bsc-temu commited on
Commit
38ea58d
1 Parent(s): a10112e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +64 -8
README.md CHANGED
@@ -4,13 +4,13 @@ language:
4
 
5
  - ca
6
 
7
- license:
8
 
9
  tags:
10
 
11
  - "catalan"
12
 
13
- - "part of speech"
14
 
15
  - "pos"
16
 
@@ -20,20 +20,76 @@ tags:
20
 
21
  datasets:
22
 
23
- - "projecte-aina/???"
24
 
25
  metrics:
26
 
27
- - "???"
 
 
 
 
 
 
 
 
 
 
 
 
28
 
29
  widget:
30
 
31
- - text: "El gat menja peix."
32
 
33
- - text: "Plou i fa sol, les bruixes es pentinen."
34
 
35
- - text: "El tren pinxo de Banyoles és el més bonic que hi ha."
36
 
37
  ---
38
 
39
- # Catalan RoBERTa-base trained on Catalan Textual Corpus fine-tuned for Part-of-speech tagging.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
 
5
  - ca
6
 
7
+ license: ???
8
 
9
  tags:
10
 
11
  - "catalan"
12
 
13
+ - "part of speech tagging"
14
 
15
  - "pos"
16
 
 
20
 
21
  datasets:
22
 
23
+ - "universal_dependencies"
24
 
25
  metrics:
26
 
27
+ - f1
28
+
29
+ model-index:
30
+ - name: roberta-base-ca-cased-pos
31
+ results:
32
+ - task:
33
+ type: token-classification
34
+ dataset:
35
+ type: universal_dependencies
36
+ name: ancora-ca-pos
37
+ metrics:
38
+ - type: f1
39
+ value: 0.9893832385244624
40
 
41
  widget:
42
 
43
+ - text: "Em dic Lluïsa i visc a Santa Maria del Camí."
44
 
45
+ - text: "L'Aina, la Berta i la Norma són molt amigues."
46
 
47
+ - text: "El Martí llegeix el Cavall Fort."
48
 
49
  ---
50
 
51
+ # Catalan BERTa (RoBERTa-base) finetuned for Named Entity Recognition.
52
+
53
+ The **roberta-base-ca-cased-pos** is a Named Entity Recognition (NER) model for the Catalan language fine-tuned from the [BERTa](https://huggingface.co/PlanTL-GOB-ES/roberta-base-ca) model, a [RoBERTa](https://arxiv.org/abs/1907.11692) base model pre-trained on a medium-size corpus collected from publicly available corpora and crawlers (check the BERTa model card for more details).
54
+
55
+ ## Datasets
56
+ We used the POS dataset in Catalan from the [Universal Dependencies Treebank](https://huggingface.co/datasets/universal_dependencies) we refer to _Ancora-ca-pos_ for training and evaluation.
57
+
58
+ ## Evaluation and results
59
+ We evaluated the _roberta-base-ca-cased-tc_ on the Ancora-ca-ner test set against standard multilingual and monolingual baselines:
60
+
61
+ | Model | Ancora-ca-pos (F1) |
62
+ | ------------|:-------------|
63
+ | roberta-base-ca-cased-pos |**98.93** |
64
+ | mBERT | 98.82 |
65
+ | XLM-RoBERTa | 98.89 |
66
+ | WikiBERT-ca | 97.60 |
67
+
68
+ For more details, check the fine-tuning and evaluation scripts in the official [GitHub repository](https://github.com/projecte-aina/berta).
69
+ ## Citing
70
+ If you use any of these resources (datasets or models) in your work, please cite our latest paper:
71
+ ```bibtex
72
+ @inproceedings{armengol-estape-etal-2021-multilingual,
73
+ title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan",
74
+ author = "Armengol-Estap{\'e}, Jordi and
75
+ Carrino, Casimiro Pio and
76
+ Rodriguez-Penagos, Carlos and
77
+ de Gibert Bonet, Ona and
78
+ Armentano-Oller, Carme and
79
+ Gonzalez-Agirre, Aitor and
80
+ Melero, Maite and
81
+ Villegas, Marta",
82
+ booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
83
+ month = aug,
84
+ year = "2021",
85
+ address = "Online",
86
+ publisher = "Association for Computational Linguistics",
87
+ url = "https://aclanthology.org/2021.findings-acl.437",
88
+ doi = "10.18653/v1/2021.findings-acl.437",
89
+ pages = "4933--4946",
90
+ }
91
+ ```
92
+ ## Funding
93
+ TODO
94
+ ## Disclaimer
95
+ TODO