--- license: gpl-3.0 language: - es metrics: - accuracy pipeline_tag: text-classification tags: - classification - pytorch - language varieties --- ## Model Description - **Model type:** Multi-class classifier on top of a transformer - **Language(s) (NLP):** Spanish varieties: Argentinian (ar), Chilean (cl), Mexican (mx), Spanish (es), and the rest (mix) - **License:** GPL-3.0 - **Finetuned from:** XLM-RoBERTa large - **Preprocessing and tokenisation:** the same as XLM-RoBERTa We provide models for a 3-class (es, mx, mix), a 4-class (cl, es, mx, mix) and a 5-class problem (ar, cl, es, mx, mix). For each case, models with 3 different seeds and the versions with one and two splits of the training documents are included. See the documentation of [docTransformer](https://github.com/cristinae/docTransformer) for more detailed information. ## Model Sources - **Repository:** https://github.com/CEREAL-es/CEREAL - **Paper:** [Elote, Choclo and Mazorca: on the Varieties of Spanish](https://aclanthology.org/2024.naacl-long.204.pdf) (NAACL 2024) - **Data:** Find the corpora at Zenodo [https://zenodo.org/records/11390829] ## Use Use the CEREAL classification models with [docTransformer](https://github.com/cristinae/docTransformer). ## Example Usage Use these models for evaluation, classification or explanation using integrated gradients: ### Slurm #### Evaluation (gold label available) ```srun --ntasks 1 --gpus-per-task 1 python -u docClassifier.py --task evaluation -f trainedModel -o C4_cereal2splits_seed1.bin -b2 --sentence_batch_size 2 --split_documents True --test_dataset data/multivariant3all.test --plotConfusionFileName modelSplit2Seed3test.png``` #### Classification (gold label unavailable) ```srun --ntasks 1 --gpus-per-task 1 python -u docClassifier.py --task classification -f trainedModel -o C4_cereal2splits_seed1.bin -b1 --sentence_batch_size 2 --split_documents True --test_dataset ../es/es_meta_part_1.jsonl.unk``` #### Explanation ```srun --ntasks 1 --gpus-per-task 1 python -u docClassifier.py --task explanation -t data/testExample.mx -f trainedModel -o C4_cereal1split_seed1.bin -b1 --split_documents False --xai_threshold_percentile 90``` ## Citation **BibTeX:** ``` @inproceedings{espana-bonet-barron-cedeno-2024-elote, title = "Elote, Choclo and Mazorca: on the Varieties of {S}panish", author = "Espa{\~n}a-Bonet, Cristina and Barr{\'o}n-Cede{\~n}o, Alberto", editor = "Duh, Kevin and Gomez, Helena and Bethard, Steven", booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)", month = jun, year = "2024", address = "Mexico City, Mexico", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.naacl-long.204", pages = "3689--3711" } ``` **APA:** España-Bonet, Cristina and Barrón-Cedeño, Alberto. (2024, June). Elote, Choclo and Mazorca: on the Varieties of Spanish. In Proceedings of the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics: NAACL 2024 (pp. 3689-3711).