Nbeau commited on
Commit
992962c
1 Parent(s): a1ee9fe

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +26 -7
README.md CHANGED
@@ -6,11 +6,18 @@ base_model:
6
  ---
7
  # grammarBERT
8
 
 
9
 
10
- `grammarBERT` fine-tunes the `codeBERT` model using a Masked Language Modeling (MLM) task on derivation sequences for Python version 3.8. By doing so, the model combines `codeBERT`’s expertise in both natural language and code token tasks to create a more specialized model capable of effectively representing and retrieving derivation sequences. This has applications in grammar-based programming tasks, improving both parsing accuracy and downstream model applications.
11
 
 
 
 
 
12
 
13
- ## Usage
 
 
14
 
15
  ```python
16
  from transformers import RobertaForMaskedLM, RobertaTokenizer
@@ -19,12 +26,24 @@ from transformers import RobertaForMaskedLM, RobertaTokenizer
19
  model = RobertaForMaskedLM.from_pretrained("Nbeau/grammarBERT")
20
  tokenizer = RobertaTokenizer.from_pretrained("Nbeau/grammarBERT")
21
 
22
- # Example of tokenizing a code snippet
23
  code_snippet = "def enumerate_items(items):"
24
- derivation_sequence = ast2seq(code_snippet) # ast2seq implementation available https://github.com/NathanaelBeau/grammarBERT/asdl/
25
- input_ids = tokenizer.encode(code_snippet, return_tensors='pt')
 
26
 
27
- # Predict masked tokens or fine-tune the model as needed
28
  outputs = model(input_ids)
 
 
 
 
 
 
 
 
 
 
 
 
29
 
30
- ```
 
6
  ---
7
  # grammarBERT
8
 
9
+ `grammarBERT` is a specialized fine-tuning of `codeBERT`, using a Masked Language Modeling (MLM) task focused on derivation sequences specific to Python 3.8. By fine-tuning on Python’s Abstract Syntax Tree (AST) structures, `grammarBERT` combines `codeBERT`’s capabilities in natural language and code token handling with a unique focus on derivation sequences, enhancing performance for grammar-based programming tasks. This is particularly useful for applications requiring syntactic understanding, improved parsing accuracy, and context-aware code generation or transformation.
10
 
11
+ ## Model Overview
12
 
13
+ - **Base Model**: `codeBERT`
14
+ - **Task**: Masked Language Modeling on derivation sequences
15
+ - **Supported Language**: Python 3.8
16
+ - **Applications**: Parsing, code transformation, syntactic analysis, grammar-based programming
17
 
18
+ ## Model Usage
19
+
20
+ To use the `grammarBERT` model with Python 3.8-specific derivation sequences, load the model and tokenizer as shown below:
21
 
22
  ```python
23
  from transformers import RobertaForMaskedLM, RobertaTokenizer
 
26
  model = RobertaForMaskedLM.from_pretrained("Nbeau/grammarBERT")
27
  tokenizer = RobertaTokenizer.from_pretrained("Nbeau/grammarBERT")
28
 
29
+ # Tokenize and prepare a code snippet
30
  code_snippet = "def enumerate_items(items):"
31
+ # Convert code to a derivation sequence (requires `ast2seq` function)
32
+ derivation_sequence = ast2seq(code_snippet) # `ast2seq` available at https://github.com/NathanaelBeau/grammarBERT/asdl/
33
+ input_ids = tokenizer.encode(derivation_sequence, return_tensors='pt')
34
 
35
+ # Use the model for masked token prediction or further fine-tuning
36
  outputs = model(input_ids)
37
+ ```
38
+
39
+ ### Training and Fine-Tuning
40
+
41
+ To train your own `grammarBERT` on a custom dataset or adapt it for different Python versions, follow the setup instructions in the [grammarBERT GitHub repository](https://github.com/NathanaelBeau/grammarBERT). The repository provides detailed guidance for:
42
+
43
+ - Preparing Python Abstract Syntax Tree (AST) sequences.
44
+ - Configuring tokenization for derivation sequences.
45
+ - Running training scripts for Masked Language Modeling (MLM) fine-tuning.
46
+
47
+ This setup allows for targeted fine-tuning on derivation sequences tailored to your specific grammar requirements.
48
+
49