This model has been trained on 80% of the COWS-L2H dataset and 80,984 SYNTHETICALLY-GENERATED errorful sentences for grammatical error correction of Spanish text. The corpus was sentencized, so the model has been fine-tuned for SENTENCE CORRECTION. This model will likely not perform well on an entire paragraph. To correct a paragraph, sentencize the text and run the model for each sentence.
The synthetic data was generated based on a rule-based algorithm from well-formed Spanish sentences. The code for synthetic generaton is available in the Github repo for this project: https://github.com/SkitCon/synth_gec_es
BLEU: 0.851 on COWS-L2H
Example usage:
from transformers import AutoTokenizer, BartForConditionalGeneration
tokenizer = AutoTokenizer.from_pretrained("SkitCon/gec-spanish-BARTO-SYNTHETIC")
model = BartForConditionalGeneration.from_pretrained("SkitCon/gec-spanish-BARTO-SYNTHETIC")
input_sentences = ["Yo va al tienda.", "Espero que tú ganas."]
tokenized_text = tokenizer(input_sentences, max_length=128, padding="max_length", truncation=True, return_tensors="pt")
input_ids = tokenized_text["input_ids"].squeeze()
attention_mask = tokenized_text["attention_mask"].squeeze()
outputs = model.generate(input_ids=input_ids, attention_mask=attention_mask)
for sentence in tokenizer.batch_decode(outputs, skip_special_tokens=True):
print(sentence)
- Downloads last month
- 43
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Model tree for SkitCon/gec-spanish-BARTO-SYNTHETIC
Base model
vgaraujov/bart-base-spanish