File size: 3,406 Bytes
06bcddf 4efca1b 06bcddf 4efca1b ca9afed 06bcddf ca9afed 06bcddf f93bd17 b999026 06bcddf 4622c32 ca9afed 4622c32 ca9afed 4622c32 ca9afed 4622c32 ca9afed 4622c32 06bcddf b999026 06bcddf b999026 06bcddf b999026 06bcddf |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 |
---
license: gpl-3.0
language:
- es
metrics:
- accuracy
pipeline_tag: text-classification
tags:
- classification
- pytorch
- language varieties
---
## Model Description
<!-- Provide a longer summary of what this model is. -->
- **Model type:** Multi-class classifier on top of a transformer
- **Language(s) (NLP):** Spanish varieties: Argentinian (ar), Chilean (cl), Mexican (mx), Spanish (es), and the rest (mix)
- **License:** GPL-3.0
- **Finetuned from:** XLM-RoBERTa large
- **Preprocessing and tokenisation:** the same as XLM-RoBERTa
We provide models for a 3-class (es, mx, mix), a 4-class (cl, es, mx, mix) and a 5-class problem (ar, cl, es, mx, mix). For each case, models with 3 different seeds and the versions with one and two splits of the training documents are included.
See the documentation of [docTransformer](https://github.com/cristinae/docTransformer) for more detailed information.
## Model Sources
- **Repository:** https://github.com/CEREAL-es/CEREAL
- **Paper:** [Elote, Choclo and Mazorca: on the Varieties of Spanish](https://aclanthology.org/2024.naacl-long.204.pdf) (NAACL 2024)
- **Data:** Find the corpora at Zenodo [https://zenodo.org/records/11390829]
## Use
Use the CEREAL classification models with [docTransformer](https://github.com/cristinae/docTransformer).
## Example Usage
Use these models for evaluation, classification or explanation using integrated gradients:
### Slurm
#### Evaluation (gold label available)
```srun --ntasks 1 --gpus-per-task 1 python -u docClassifier.py --task evaluation -f trainedModel -o C4_cereal2splits_seed1.bin -b2 --sentence_batch_size 2 --split_documents True --test_dataset data/multivariant3all.test --plotConfusionFileName modelSplit2Seed3test.png```
#### Classification (gold label unavailable)
```srun --ntasks 1 --gpus-per-task 1 python -u docClassifier.py --task classification -f trainedModel -o C4_cereal2splits_seed1.bin -b1 --sentence_batch_size 2 --split_documents True --test_dataset ../es/es_meta_part_1.jsonl.unk```
#### Explanation
```srun --ntasks 1 --gpus-per-task 1 python -u docClassifier.py --task explanation -t data/testExample.mx -f trainedModel -o C4_cereal1split_seed1.bin -b1 --split_documents False --xai_threshold_percentile 90```
## Citation
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
**BibTeX:**
```
@inproceedings{espana-bonet-barron-cedeno-2024-elote,
title = "Elote, Choclo and Mazorca: on the Varieties of {S}panish",
author = "Espa{\~n}a-Bonet, Cristina and
Barr{\'o}n-Cede{\~n}o, Alberto",
editor = "Duh, Kevin and
Gomez, Helena and
Bethard, Steven",
booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
month = jun,
year = "2024",
address = "Mexico City, Mexico",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.naacl-long.204",
pages = "3689--3711"
}
```
**APA:**
España-Bonet, Cristina and Barrón-Cedeño, Alberto. (2024, June). Elote, Choclo and Mazorca: on the Varieties of Spanish. In Proceedings of the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics: NAACL 2024 (pp. 3689-3711).
|