File size: 1,212 Bytes
84bcaf3 a40257c 6adec45 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
---
datasets:
- bigcode/the-stack-v2
base_model:
- microsoft/codebert-base
---
# grammarBERT
`grammarBERT` fine-tunes the `codeBERT` model using a Masked Language Modeling (MLM) task on derivation sequences for Python version 3.8. By doing so, the model combines `codeBERT`’s expertise in both natural language and code token tasks to create a more specialized model capable of effectively representing and retrieving derivation sequences. This has applications in grammar-based programming tasks, improving both parsing accuracy and downstream model applications.
## Usage
```python
from transformers import RobertaForMaskedLM, RobertaTokenizer
# Load the pre-trained codeBERT model and tokenizer
model = RobertaForMaskedLM.from_pretrained("microsoft/codebert-base")
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
# Example of tokenizing a code snippet
code_snippet = "def enumerate_items(items):"
derivation_sequence = ast2seq(code_snippet) # ast2seq implementation available https://github.com/NathanaelBeau/grammarBERT/
input_ids = tokenizer.encode(code_snippet, return_tensors='pt')
# Predict masked tokens or fine-tune the model as needed
outputs = model(input_ids)
``` |