metadata
datasets:
- bigcode/the-stack-v2
base_model:
- microsoft/codebert-base
grammarBERT
grammarBERT
fine-tunes the codeBERT
model using a Masked Language Modeling (MLM) task on derivation sequences for Python version 3.8. By doing so, the model combines codeBERT
’s expertise in both natural language and code token tasks to create a more specialized model capable of effectively representing and retrieving derivation sequences. This has applications in grammar-based programming tasks, improving both parsing accuracy and downstream model applications.
Usage
from transformers import RobertaForMaskedLM, RobertaTokenizer
# Load the pre-trained codeBERT model and tokenizer
model = RobertaForMaskedLM.from_pretrained("microsoft/codebert-base")
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
# Example of tokenizing a code snippet
code_snippet = "def enumerate_items(items):"
derivation_sequence = ast2seq(code_snippet) # ast2seq implementation available https://github.com/NathanaelBeau/grammarBERT/
input_ids = tokenizer.encode(code_snippet, return_tensors='pt')
# Predict masked tokens or fine-tune the model as needed
outputs = model(input_ids)