--- datasets: - bigcode/the-stack-v2 base_model: - microsoft/codebert-base --- # grammarBERT `grammarBERT` fine-tunes the `codeBERT` model using a Masked Language Modeling (MLM) task on derivation sequences for Python version 3.8. By doing so, the model combines `codeBERT`’s expertise in both natural language and code token tasks to create a more specialized model capable of effectively representing and retrieving derivation sequences. This has applications in grammar-based programming tasks, improving both parsing accuracy and downstream model applications. ## Usage ```python from transformers import RobertaForMaskedLM, RobertaTokenizer # Load the pre-trained codeBERT model and tokenizer model = RobertaForMaskedLM.from_pretrained("microsoft/codebert-base") tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base") # Example of tokenizing a code snippet code_snippet = "def enumerate_items(items):" derivation_sequence = ast2seq(code_snippet) # ast2seq implementation available https://github.com/NathanaelBeau/grammarBERT/ input_ids = tokenizer.encode(code_snippet, return_tensors='pt') # Predict masked tokens or fine-tune the model as needed outputs = model(input_ids) ```