Update README.md
Browse files
README.md
CHANGED
@@ -6,11 +6,18 @@ base_model:
|
|
6 |
---
|
7 |
# grammarBERT
|
8 |
|
|
|
9 |
|
10 |
-
|
11 |
|
|
|
|
|
|
|
|
|
12 |
|
13 |
-
## Usage
|
|
|
|
|
14 |
|
15 |
```python
|
16 |
from transformers import RobertaForMaskedLM, RobertaTokenizer
|
@@ -19,12 +26,24 @@ from transformers import RobertaForMaskedLM, RobertaTokenizer
|
|
19 |
model = RobertaForMaskedLM.from_pretrained("Nbeau/grammarBERT")
|
20 |
tokenizer = RobertaTokenizer.from_pretrained("Nbeau/grammarBERT")
|
21 |
|
22 |
-
#
|
23 |
code_snippet = "def enumerate_items(items):"
|
24 |
-
|
25 |
-
|
|
|
26 |
|
27 |
-
#
|
28 |
outputs = model(input_ids)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
29 |
|
30 |
-
```
|
|
|
6 |
---
|
7 |
# grammarBERT
|
8 |
|
9 |
+
`grammarBERT` is a specialized fine-tuning of `codeBERT`, using a Masked Language Modeling (MLM) task focused on derivation sequences specific to Python 3.8. By fine-tuning on Python’s Abstract Syntax Tree (AST) structures, `grammarBERT` combines `codeBERT`’s capabilities in natural language and code token handling with a unique focus on derivation sequences, enhancing performance for grammar-based programming tasks. This is particularly useful for applications requiring syntactic understanding, improved parsing accuracy, and context-aware code generation or transformation.
|
10 |
|
11 |
+
## Model Overview
|
12 |
|
13 |
+
- **Base Model**: `codeBERT`
|
14 |
+
- **Task**: Masked Language Modeling on derivation sequences
|
15 |
+
- **Supported Language**: Python 3.8
|
16 |
+
- **Applications**: Parsing, code transformation, syntactic analysis, grammar-based programming
|
17 |
|
18 |
+
## Model Usage
|
19 |
+
|
20 |
+
To use the `grammarBERT` model with Python 3.8-specific derivation sequences, load the model and tokenizer as shown below:
|
21 |
|
22 |
```python
|
23 |
from transformers import RobertaForMaskedLM, RobertaTokenizer
|
|
|
26 |
model = RobertaForMaskedLM.from_pretrained("Nbeau/grammarBERT")
|
27 |
tokenizer = RobertaTokenizer.from_pretrained("Nbeau/grammarBERT")
|
28 |
|
29 |
+
# Tokenize and prepare a code snippet
|
30 |
code_snippet = "def enumerate_items(items):"
|
31 |
+
# Convert code to a derivation sequence (requires `ast2seq` function)
|
32 |
+
derivation_sequence = ast2seq(code_snippet) # `ast2seq` available at https://github.com/NathanaelBeau/grammarBERT/asdl/
|
33 |
+
input_ids = tokenizer.encode(derivation_sequence, return_tensors='pt')
|
34 |
|
35 |
+
# Use the model for masked token prediction or further fine-tuning
|
36 |
outputs = model(input_ids)
|
37 |
+
```
|
38 |
+
|
39 |
+
### Training and Fine-Tuning
|
40 |
+
|
41 |
+
To train your own `grammarBERT` on a custom dataset or adapt it for different Python versions, follow the setup instructions in the [grammarBERT GitHub repository](https://github.com/NathanaelBeau/grammarBERT). The repository provides detailed guidance for:
|
42 |
+
|
43 |
+
- Preparing Python Abstract Syntax Tree (AST) sequences.
|
44 |
+
- Configuring tokenization for derivation sequences.
|
45 |
+
- Running training scripts for Masked Language Modeling (MLM) fine-tuning.
|
46 |
+
|
47 |
+
This setup allows for targeted fine-tuning on derivation sequences tailored to your specific grammar requirements.
|
48 |
+
|
49 |
|
|