Model Description

  • Model type: Multi-class classifier on top of a transformer
  • Language(s) (NLP): Spanish varieties: Argentinian (ar), Chilean (cl), Mexican (mx), Spanish (es), and the rest (mix)
  • License: GPL-3.0
  • Finetuned from: XLM-RoBERTa large
  • Preprocessing and tokenisation: the same as XLM-RoBERTa

We provide models for a 3-class (es, mx, mix), a 4-class (cl, es, mx, mix) and a 5-class problem (ar, cl, es, mx, mix). For each case, models with 3 different seeds and the versions with one and two splits of the training documents are included. See the documentation of docTransformer for more detailed information.

Model Sources

Use

Use the CEREAL classification models with docTransformer.

Example Usage

Use these models for evaluation, classification or explanation using integrated gradients:

Slurm

Evaluation (gold label available)

srun --ntasks 1 --gpus-per-task 1 python -u docClassifier.py --task evaluation -f trainedModel -o C4_cereal2splits_seed1.bin -b2 --sentence_batch_size 2 --split_documents True --test_dataset data/multivariant3all.test --plotConfusionFileName modelSplit2Seed3test.png

Classification (gold label unavailable)

srun --ntasks 1 --gpus-per-task 1 python -u docClassifier.py --task classification -f trainedModel -o C4_cereal2splits_seed1.bin -b1 --sentence_batch_size 2 --split_documents True --test_dataset ../es/es_meta_part_1.jsonl.unk

Explanation

srun --ntasks 1 --gpus-per-task 1 python -u docClassifier.py --task explanation -t data/testExample.mx -f trainedModel -o C4_cereal1split_seed1.bin -b1 --split_documents False --xai_threshold_percentile 90

Citation

BibTeX:

@inproceedings{espana-bonet-barron-cedeno-2024-elote,
    title = "Elote, Choclo and Mazorca: on the Varieties of {S}panish",
    author = "Espa{\~n}a-Bonet, Cristina  and
      Barr{\'o}n-Cede{\~n}o, Alberto",
    editor = "Duh, Kevin  and
      Gomez, Helena  and
      Bethard, Steven",
    booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
    month = jun,
    year = "2024",
    address = "Mexico City, Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.naacl-long.204",
    pages = "3689--3711"
}

APA:

España-Bonet, Cristina and Barrón-Cedeño, Alberto. (2024, June). Elote, Choclo and Mazorca: on the Varieties of Spanish. In Proceedings of the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics: NAACL 2024 (pp. 3689-3711).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Examples
Unable to determine this model's library. Check the docs .