|
--- |
|
license: gpl-3.0 |
|
language: |
|
- es |
|
metrics: |
|
- accuracy |
|
pipeline_tag: text-classification |
|
tags: |
|
- classification |
|
- pytorch |
|
- language varieties |
|
--- |
|
## Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
- **Model type:** Multi-class classifier on top of a transformer |
|
- **Language(s) (NLP):** Spanish varieties: Argentinian (ar), Chilean (cl), Mexican (mx), Spanish (es), and the rest (mix) |
|
- **License:** GPL-3.0 |
|
- **Finetuned from:** XLM-RoBERTa large |
|
- **Preprocessing and tokenisation:** the same as XLM-RoBERTa |
|
|
|
We provide models for a 3-class (es, mx, mix), a 4-class (cl, es, mx, mix) and a 5-class problem (ar, cl, es, mx, mix). For each case, models with 3 different seeds and the versions with one and two splits of the training documents are included. |
|
See the documentation of [docTransformer](https://github.com/cristinae/docTransformer) for more detailed information. |
|
|
|
|
|
## Model Sources |
|
|
|
- **Repository:** https://github.com/CEREAL-es/CEREAL |
|
- **Paper:** [Elote, Choclo and Mazorca: on the Varieties of Spanish](https://aclanthology.org/2024.naacl-long.204.pdf) (NAACL 2024) |
|
- **Data:** Find the corpora at Zenodo [https://zenodo.org/records/11390829] |
|
|
|
## Use |
|
|
|
Use the CEREAL classification models with [docTransformer](https://github.com/cristinae/docTransformer). |
|
|
|
|
|
## Example Usage |
|
|
|
Use these models for evaluation, classification or explanation using integrated gradients: |
|
|
|
### Slurm |
|
|
|
#### Evaluation (gold label available) |
|
|
|
```srun --ntasks 1 --gpus-per-task 1 python -u docClassifier.py --task evaluation -f trainedModel -o C4_cereal2splits_seed1.bin -b2 --sentence_batch_size 2 --split_documents True --test_dataset data/multivariant3all.test --plotConfusionFileName modelSplit2Seed3test.png``` |
|
|
|
#### Classification (gold label unavailable) |
|
|
|
```srun --ntasks 1 --gpus-per-task 1 python -u docClassifier.py --task classification -f trainedModel -o C4_cereal2splits_seed1.bin -b1 --sentence_batch_size 2 --split_documents True --test_dataset ../es/es_meta_part_1.jsonl.unk``` |
|
|
|
#### Explanation |
|
|
|
```srun --ntasks 1 --gpus-per-task 1 python -u docClassifier.py --task explanation -t data/testExample.mx -f trainedModel -o C4_cereal1split_seed1.bin -b1 --split_documents False --xai_threshold_percentile 90``` |
|
|
|
|
|
## Citation |
|
|
|
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. --> |
|
|
|
**BibTeX:** |
|
``` |
|
@inproceedings{espana-bonet-barron-cedeno-2024-elote, |
|
title = "Elote, Choclo and Mazorca: on the Varieties of {S}panish", |
|
author = "Espa{\~n}a-Bonet, Cristina and |
|
Barr{\'o}n-Cede{\~n}o, Alberto", |
|
editor = "Duh, Kevin and |
|
Gomez, Helena and |
|
Bethard, Steven", |
|
booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)", |
|
month = jun, |
|
year = "2024", |
|
address = "Mexico City, Mexico", |
|
publisher = "Association for Computational Linguistics", |
|
url = "https://aclanthology.org/2024.naacl-long.204", |
|
pages = "3689--3711" |
|
} |
|
``` |
|
|
|
**APA:** |
|
|
|
España-Bonet, Cristina and Barrón-Cedeño, Alberto. (2024, June). Elote, Choclo and Mazorca: on the Varieties of Spanish. In Proceedings of the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics: NAACL 2024 (pp. 3689-3711). |
|
|
|
|