cristinae
/

cereal

Text Classification

language varieties

Model card Files Files and versions Community

cereal / README.md

cristinae's picture

Update README.md

f93bd17 verified 9 months ago

|

history blame contribute delete

3.41 kB

	---
	license: gpl-3.0
	language:
	- es
	metrics:
	- accuracy
	pipeline_tag: text-classification
	tags:
	- classification
	- pytorch
	- language varieties
	---
	## Model Description

	<!-- Provide a longer summary of what this model is. -->

	- Model type: Multi-class classifier on top of a transformer
	- Language(s) (NLP): Spanish varieties: Argentinian (ar), Chilean (cl), Mexican (mx), Spanish (es), and the rest (mix)
	- License: GPL-3.0
	- Finetuned from: XLM-RoBERTa large
	- Preprocessing and tokenisation: the same as XLM-RoBERTa

	We provide models for a 3-class (es, mx, mix), a 4-class (cl, es, mx, mix) and a 5-class problem (ar, cl, es, mx, mix). For each case, models with 3 different seeds and the versions with one and two splits of the training documents are included.
	See the documentation of [docTransformer](https://github.com/cristinae/docTransformer) for more detailed information.


	## Model Sources

	- Repository: https://github.com/CEREAL-es/CEREAL
	- Paper: [Elote, Choclo and Mazorca: on the Varieties of Spanish](https://aclanthology.org/2024.naacl-long.204.pdf) (NAACL 2024)
	- Data: Find the corpora at Zenodo [https://zenodo.org/records/11390829]

	## Use

	Use the CEREAL classification models with [docTransformer](https://github.com/cristinae/docTransformer).


	## Example Usage

	Use these models for evaluation, classification or explanation using integrated gradients:

	### Slurm

	#### Evaluation (gold label available)

	```srun --ntasks 1 --gpus-per-task 1 python -u docClassifier.py --task evaluation -f trainedModel -o C4_cereal2splits_seed1.bin -b2 --sentence_batch_size 2 --split_documents True --test_dataset data/multivariant3all.test --plotConfusionFileName modelSplit2Seed3test.png```

	#### Classification (gold label unavailable)

	```srun --ntasks 1 --gpus-per-task 1 python -u docClassifier.py --task classification -f trainedModel -o C4_cereal2splits_seed1.bin -b1 --sentence_batch_size 2 --split_documents True --test_dataset ../es/es_meta_part_1.jsonl.unk```

	#### Explanation

	```srun --ntasks 1 --gpus-per-task 1 python -u docClassifier.py --task explanation -t data/testExample.mx -f trainedModel -o C4_cereal1split_seed1.bin -b1 --split_documents False --xai_threshold_percentile 90```


	## Citation

	<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

	BibTeX:
	```
	@inproceedings{espana-bonet-barron-cedeno-2024-elote,
	title = "Elote, Choclo and Mazorca: on the Varieties of {S}panish",
	author = "Espa{\~n}a-Bonet, Cristina and
	Barr{\'o}n-Cede{\~n}o, Alberto",
	editor = "Duh, Kevin and
	Gomez, Helena and
	Bethard, Steven",
	booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
	month = jun,
	year = "2024",
	address = "Mexico City, Mexico",
	publisher = "Association for Computational Linguistics",
	url = "https://aclanthology.org/2024.naacl-long.204",
	pages = "3689--3711"
	}
	```

	APA:

	España-Bonet, Cristina and Barrón-Cedeño, Alberto. (2024, June). Elote, Choclo and Mazorca: on the Varieties of Spanish. In Proceedings of the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics: NAACL 2024 (pp. 3689-3711).