igorsterner
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -1,42 +1,171 @@
|
|
1 |
---
|
2 |
license: mit
|
3 |
base_model: xlm-roberta-base
|
4 |
-
|
5 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
6 |
metrics:
|
7 |
-
- precision
|
8 |
-
- recall
|
9 |
- f1
|
10 |
-
model-index:
|
11 |
-
- name: xlm-roberta-base-Multilingual-Sentence-Segmentation-v4
|
12 |
-
results: []
|
13 |
---
|
14 |
|
15 |
-
|
16 |
-
should probably proofread and complete it, then remove this comment. -->
|
17 |
|
18 |
-
|
19 |
-
|
20 |
-
This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) on the None dataset.
|
21 |
-
It achieves the following results on the evaluation set:
|
22 |
- Loss: 0.0074
|
23 |
- Precision: 0.9664
|
24 |
- Recall: 0.9677
|
25 |
- F1: 0.9670
|
26 |
|
27 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
28 |
|
29 |
-
|
|
|
|
|
|
|
|
|
30 |
|
31 |
-
|
32 |
|
33 |
-
|
34 |
|
35 |
-
|
36 |
|
37 |
-
|
38 |
|
39 |
-
|
40 |
|
41 |
### Training hyperparameters
|
42 |
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
base_model: xlm-roberta-base
|
4 |
+
language:
|
5 |
+
- multilingual
|
6 |
+
- af
|
7 |
+
- am
|
8 |
+
- ar
|
9 |
+
- as
|
10 |
+
- az
|
11 |
+
- be
|
12 |
+
- bg
|
13 |
+
- bn
|
14 |
+
- br
|
15 |
+
- bs
|
16 |
+
- ca
|
17 |
+
- cs
|
18 |
+
- cy
|
19 |
+
- da
|
20 |
+
- de
|
21 |
+
- el
|
22 |
+
- en
|
23 |
+
- eo
|
24 |
+
- es
|
25 |
+
- et
|
26 |
+
- eu
|
27 |
+
- fa
|
28 |
+
- fi
|
29 |
+
- fr
|
30 |
+
- fy
|
31 |
+
- ga
|
32 |
+
- gd
|
33 |
+
- gl
|
34 |
+
- gu
|
35 |
+
- ha
|
36 |
+
- he
|
37 |
+
- hi
|
38 |
+
- hr
|
39 |
+
- hu
|
40 |
+
- hy
|
41 |
+
- id
|
42 |
+
- is
|
43 |
+
- it
|
44 |
+
- ja
|
45 |
+
- jv
|
46 |
+
- ka
|
47 |
+
- kk
|
48 |
+
- km
|
49 |
+
- kn
|
50 |
+
- ko
|
51 |
+
- ku
|
52 |
+
- ky
|
53 |
+
- la
|
54 |
+
- lo
|
55 |
+
- lt
|
56 |
+
- lv
|
57 |
+
- mg
|
58 |
+
- mk
|
59 |
+
- ml
|
60 |
+
- mn
|
61 |
+
- mr
|
62 |
+
- ms
|
63 |
+
- my
|
64 |
+
- ne
|
65 |
+
- nl
|
66 |
+
- 'no'
|
67 |
+
- om
|
68 |
+
- or
|
69 |
+
- pa
|
70 |
+
- pl
|
71 |
+
- ps
|
72 |
+
- pt
|
73 |
+
- ro
|
74 |
+
- ru
|
75 |
+
- sa
|
76 |
+
- sd
|
77 |
+
- si
|
78 |
+
- sk
|
79 |
+
- sl
|
80 |
+
- so
|
81 |
+
- sq
|
82 |
+
- sr
|
83 |
+
- su
|
84 |
+
- sv
|
85 |
+
- sw
|
86 |
+
- ta
|
87 |
+
- te
|
88 |
+
- th
|
89 |
+
- tl
|
90 |
+
- tr
|
91 |
+
- ug
|
92 |
+
- uk
|
93 |
+
- ur
|
94 |
+
- uz
|
95 |
+
- vi
|
96 |
+
- xh
|
97 |
+
- yi
|
98 |
+
- zh
|
99 |
metrics:
|
|
|
|
|
100 |
- f1
|
|
|
|
|
|
|
101 |
---
|
102 |
|
103 |
+
# xlmr-multilingual-sentence-segmentation
|
|
|
104 |
|
105 |
+
This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) on a corrupted version of the universal dependency datasets.
|
106 |
+
It achieves the following results on the (also corrupted) evaluation set:
|
|
|
|
|
107 |
- Loss: 0.0074
|
108 |
- Precision: 0.9664
|
109 |
- Recall: 0.9677
|
110 |
- F1: 0.9670
|
111 |
|
112 |
+
# Test set performance
|
113 |
+
|
114 |
+
# Results
|
115 |
+
|
116 |
+
All results here are percentage F1:
|
117 |
+
|
118 |
+
## Opus100 [2]
|
119 |
+
|
120 |
+
Who wins most? XLM-RoBERTa: 56, WtPSplit: 12, Spacy (multilingual): 8
|
121 |
+
|
122 |
+
|
123 |
+
| | af | am | ar | az | be | bg | bn | ca | cs | cy | da | de | el | en | eo | es | et | eu | fa | fi | fr | fy | ga | gd | gl | gu | ha | he | hi | hu | hy | id | is | it | ja | ka | kk | km | kn | ko | ku | ky | lt | lv | mg | mk | ml | mn | mr | ms | my | ne | nl | pa | pl | ps | pt | ro | ru | si | sk | sl | sq | sr | sv | ta | te | th | tr | uk | ur | uz | vi | xh | yi | zh |
|
124 |
+
|:---------------------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|
|
125 |
+
| Spacy (multilingual) | 42.61 | 6.69 | 58.52 | 73.59 | 34.78 | 93.74 | 38.04 | 88.76 | 87.70 | 26.30 | 90.52 | 74.15 | 89.75 | 89.25 | 88.77 | 90.95 | 87.26 | 81.20 | 55.40 | 93.28 | 85.77 | 21.49 | 60.61 | 36.83 | 88.77 | 5.59 | **89.39** | **92.21** | 53.33 | 93.26 | 24.14 | 90.13 | **95.38** | 86.32 | 0.20 | 38.24 | 42.39 | 0.10 | 9.66 | 51.79 | 27.64 | 21.77 | 76.91 | 77.02 | 83.60 | **93.74** | 39.09 | 33.23 | 86.56 | 87.39 | 0.10 | 6.59 | **93.65** | 5.26 | 92.42 | 2.41 | 92.07 | 91.63 | 75.95 | 75.91 | 92.13 | 93.00 | **92.96** | **95.01** | 93.52 | 36.97 | 64.59 | 21.64 | **94.05** | 89.68 | 29.17 | 64.99 | 90.59 | 64.89 | 4.14 | 0.09 |
|
126 |
+
| WtPSplit | 76.90 | **59.08** | 68.08 | 76.42 | 71.29 | 93.97 | 79.76 | 89.79 | 89.36 | 73.21 | 90.02 | 80.74 | 92.80 | 91.91 | 92.24 | 92.11 | 84.47 | 87.24 | 59.97 | 91.96 | 88.53 | 65.84 | 79.49 | 83.33 | 90.31 | **70.51** | 82.43 | 90.58 | 66.70 | 93.00 | 87.14 | 89.80 | 94.77 | 87.43 | **41.79** | **91.26** | 73.25 | **69.54** | 68.98 | 56.21 | **79.12** | 83.94 | 81.33 | 82.70 | **89.33** | 92.87 | 80.81 | 73.26 | 89.20 | 88.51 | **65.54** | **71.33** | 92.63 | 64.11 | 92.72 | **62.84** | 91.05 | 90.91 | 84.23 | 80.32 | 92.30 | 92.19 | 90.32 | 94.76 | 92.08 | 63.48 | 76.49 | 68.88 | 93.30 | 89.60 | 52.59 | **77.79** | 91.29 | 80.28 | **75.70** | 71.64 |
|
127 |
+
| XLM-RoBERTa (ours) | **83.97** | 41.59 | **81.56** | **81.30** | **85.68** | **94.34** | **84.10** | **91.80** | **91.23** | **78.72** | **92.64** | **86.73** | **93.87** | **94.50** | **94.57** | **93.18** | **90.19** | **90.28** | **74.79** | **94.06** | **90.46** | **81.76** | **84.33** | **85.62** | **92.55** | 67.26 | 86.61 | 91.22 | **72.69** | **94.53** | **89.83** | **92.24** | 93.78 | **89.27** | 41.43 | 78.39 | **89.15** | 36.60 | **70.51** | **82.77** | 58.14 | **89.41** | **89.99** | **88.25** | 86.82 | 92.81 | **86.14** | **94.73** | **93.25** | **92.44** | 49.39 | 66.02 | 93.60 | **69.22** | **93.51** | 61.86 | **92.84** | **93.19** | **89.47** | **86.24** | **92.95** | **93.46** | 91.79 | 94.16 | **93.93** | **72.74** | **81.77** | **74.49** | 93.17 | **92.15** | **62.92** | 75.65 | **93.41** | **84.89** | 56.85 | **77.07** |
|
128 |
+
|
129 |
+
|
130 |
+
## Universal Dependencies [3]
|
131 |
+
|
132 |
+
Who wins most? XLM-RoBERTa: 24, WtPSplit: 17 Spacy (multilingual): 13
|
133 |
+
|
134 |
+
|
135 |
+
| | af | ar | be | bg | bn | ca | cs | cy | da | de | el | en | es | et | eu | fa | fi | fr | ga | gd | gl | he | hi | hu | hy | id | is | it | ja | jv | kk | ko | la | lt | lv | mr | nl | pl | pt | ro | ru | sk | sl | sq | sr | sv | ta | th | tr | uk | ur | vi | zh |
|
136 |
+
|:---------------------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:-----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|
|
137 |
+
| Spacy (multilingual) | **98.47** | 80.38 | 80.27 | 93.62 | 51.85 | **98.95** | 89.68 | 98.89 | 94.96 | 88.02 | 94.16 | 92.20 | **98.70** | 93.77 | 95.79 | **99.83** | 92.88 | 96.33 | **96.67** | 63.04 | 92.37 | 94.37 | 0.32 | **98.45** | 11.39 | 98.01 | **95.41** | 92.49 | 0.37 | 98.03 | 96.21 | **99.80** | 0.09 | 93.86 | **98.52** | 92.13 | 92.86 | 97.02 | 94.91 | **98.05** | 84.31 | 90.26 | **98.23** | **100.00** | 97.84 | 94.91 | 66.67 | 1.95 | **97.63** | 94.16 | 0.37 | 96.40 | 0.40 |
|
138 |
+
| WtPSplit | 98.27 | **83.00** | 89.28 | **98.16** | **99.12** | 98.52 | 92.98 | **99.26** | 94.56 | 96.13 | **96.94** | 94.73 | 97.60 | 94.09 | 97.24 | 97.29 | 94.69 | **96.71** | 86.60 | 72.17 | **98.87** | 95.79 | 96.78 | 96.08 | **96.80** | **98.41** | 86.39 | 95.45 | **95.84** | **98.18** | 96.28 | 99.11 | 91.43 | **97.67** | 96.42 | 91.84 | 93.61 | 95.92 | **96.13** | 81.50 | 86.28 | 95.57 | 96.85 | 99.17 | **98.45** | **95.86** | **97.54** | 70.26 | 96.00 | 92.08 | 93.79 | 92.97 | **97.25** |
|
139 |
+
| XLM-RoBERTa (ours) | 96.81 | 78.99 | **91.60** | 97.89 | **99.12** | 95.99 | **96.05** | 97.17 | **96.62** | **96.29** | 94.33 | **94.76** | 95.73 | **96.20** | **97.37** | 97.49 | **96.34** | 95.70 | 89.78 | **84.20** | 95.72 | **95.95** | **97.51** | 96.24 | 95.62 | 97.22 | 92.93 | **96.88** | 94.23 | 96.29 | **98.40** | 97.46 | **96.35** | 95.82 | 96.91 | **95.92** | **96.27** | **97.24** | 95.83 | 94.63 | **91.59** | **95.88** | 96.43 | 98.36 | 96.83 | 94.95 | 95.93 | **89.26** | 96.52 | **94.59** | **96.20** | **97.31** | 95.12 |
|
140 |
+
|
141 |
+
## Ersatz [4]
|
142 |
+
|
143 |
+
Who wins most? XLM-RoBERTa: 10, WtPSplit: 8, Spacy (multilingual): 4
|
144 |
+
|
145 |
+
|
146 |
+
| | ar | cs | de | en | es | et | fi | fr | gu | hi | ja | kk | km | lt | lv | pl | ps | ro | ru | ta | tr | zh |
|
147 |
+
|:---------------------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|
|
148 |
+
| Spacy (multilingual) | **91.26** | 96.46 | 93.89 | 94.40 | 97.31 | **97.15** | 94.99 | 96.43 | 4.44 | 18.41 | 0.18 | 97.11 | 0.08 | 93.53 | **98.73** | 93.69 | **94.44** | 94.87 | 93.45 | 68.65 | 95.39 | 0.10 |
|
149 |
+
| WtPSplit | 89.45 | 93.41 | 95.93 | **97.16** | **98.74** | 95.84 | 97.10 | **97.61** | 90.62 | 94.87 | **82.14** | 95.94 | **82.89** | **96.74** | 97.22 | 95.16 | 86.99 | **97.55** | **97.82** | 94.76 | 93.53 | 89.02 |
|
150 |
+
| XLM-RoBERTa (ours) | 79.78 | **96.94** | **97.02** | 96.10 | 97.06 | 96.80 | **97.67** | 96.33 | **93.73** | **95.34** | 77.54 | **97.28** | 78.94 | 96.13 | 96.45 | **96.71** | 92.33 | 96.24 | 97.15 | **95.94** | **95.76** | **90.11** |
|
151 |
+
|
152 |
+
## German--English code-switching [5]
|
153 |
|
154 |
+
| | de |
|
155 |
+
|:---------------------|:----------|
|
156 |
+
| Spacy (multilingual) | 79.55 |
|
157 |
+
| WtPSplit | 77.41 |
|
158 |
+
| XLM-RoBERTa (ours) | **85.78** |
|
159 |
|
160 |
+
[1] [Where’s the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation](https://aclanthology.org/2023.acl-long.398) (Minixhofer et al., ACL 2023)
|
161 |
|
162 |
+
[2] [Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation](https://aclanthology.org/2020.acl-main.148) (Zhang et al., ACL 2020)
|
163 |
|
164 |
+
[3] [Universal Dependencies](https://aclanthology.org/2021.cl-2.11) (de Marneffe et al., CL 2021)
|
165 |
|
166 |
+
[4] [A unified approach to sentence segmentation of punctuated text in many languages](https://aclanthology.org/2021.acl-long.309) (Wicks & Post, ACL-IJCNLP 2021)
|
167 |
|
168 |
+
[5] [The Denglisch Corpus of German-English Code-Switching](https://aclanthology.org/2023.sigtyp-1.5) (Osmelak & Wintner, SIGTYP 2023)
|
169 |
|
170 |
### Training hyperparameters
|
171 |
|