metadata

datasets:
  - bigcode/the-stack-v2
base_model:
  - microsoft/codebert-base

grammarBERT

grammarBERT fine-tunes the codeBERT model using a Masked Language Modeling (MLM) task on derivation sequences for Python version 3.8. By doing so, the model combines codeBERT’s expertise in both natural language and code token tasks to create a more specialized model capable of effectively representing and retrieving derivation sequences. This has applications in grammar-based programming tasks, improving both parsing accuracy and downstream model applications.

Usage

from transformers import RobertaForMaskedLM, RobertaTokenizer

# Load the pre-trained codeBERT model and tokenizer
model = RobertaForMaskedLM.from_pretrained("microsoft/codebert-base")
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")

# Example of tokenizing a code snippet
code_snippet = "def enumerate_items(items):"
derivation_sequence = ast2seq(code_snippet) # ast2seq implementation available https://github.com/NathanaelBeau/grammarBERT/
input_ids = tokenizer.encode(code_snippet, return_tensors='pt')

# Predict masked tokens or fine-tune the model as needed
outputs = model(input_ids)