File size: 1,212 Bytes
84bcaf3
 
 
 
 
a40257c
6adec45
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
---
datasets:
- bigcode/the-stack-v2
base_model:
- microsoft/codebert-base
---
# grammarBERT


`grammarBERT` fine-tunes the `codeBERT` model using a Masked Language Modeling (MLM) task on derivation sequences for Python version 3.8. By doing so, the model combines `codeBERT`’s expertise in both natural language and code token tasks to create a more specialized model capable of effectively representing and retrieving derivation sequences. This has applications in grammar-based programming tasks, improving both parsing accuracy and downstream model applications.


## Usage 

```python
from transformers import RobertaForMaskedLM, RobertaTokenizer

# Load the pre-trained codeBERT model and tokenizer
model = RobertaForMaskedLM.from_pretrained("microsoft/codebert-base")
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")

# Example of tokenizing a code snippet
code_snippet = "def enumerate_items(items):"
derivation_sequence = ast2seq(code_snippet) # ast2seq implementation available https://github.com/NathanaelBeau/grammarBERT/
input_ids = tokenizer.encode(code_snippet, return_tensors='pt')

# Predict masked tokens or fine-tune the model as needed
outputs = model(input_ids)

```