Update README.md
Browse files
README.md
CHANGED
@@ -4,12 +4,14 @@ language: ta
|
|
4 |
|
5 |
# TaMillion
|
6 |
|
7 |
-
This is
|
8 |
Google Research's [ELECTRA](https://github.com/google-research/electra).
|
9 |
|
10 |
-
Tokenization and pre-training CoLab: https://colab.research.google.com/drive/
|
11 |
|
12 |
-
|
|
|
|
|
13 |
|
14 |
## Classification
|
15 |
|
@@ -19,22 +21,24 @@ https://www.kaggle.com/sudalairajkumar/tamil-nlp
|
|
19 |
Notebook: https://colab.research.google.com/drive/1_rW9HZb6G87-5DraxHvhPOzGmSMUc67_?usp=sharin
|
20 |
|
21 |
The model outperformed mBERT on news classification:
|
22 |
-
(Random: 16.7%, mBERT: 53.0%, TaMillion:
|
23 |
|
24 |
The model slightly outperformed mBERT on movie reviews:
|
25 |
-
(RMSE - mBERT: 0.657, TaMillion: 0.
|
26 |
|
27 |
Equivalent accuracy on the Tirukkural topic task.
|
28 |
|
29 |
## Question Answering
|
30 |
|
31 |
-
I didn't find a Tamil-language question answering dataset, but this model could be
|
32 |
to train a QA model. See Hindi and Bengali examples here: https://colab.research.google.com/drive/1i6fidh2tItf_-IDkljMuaIGmEU6HT2Ar
|
33 |
|
34 |
## Corpus
|
35 |
|
36 |
-
Trained on
|
|
|
|
|
37 |
|
38 |
## Vocabulary
|
39 |
|
40 |
-
Included as vocab.txt in the upload
|
|
|
4 |
|
5 |
# TaMillion
|
6 |
|
7 |
+
This is the second version of a Tamil language model trained with
|
8 |
Google Research's [ELECTRA](https://github.com/google-research/electra).
|
9 |
|
10 |
+
Tokenization and pre-training CoLab: https://colab.research.google.com/drive/1Pwia5HJIb6Ad4Hvbx5f-IjND-vCaJzSE?usp=sharing
|
11 |
|
12 |
+
V1: small model with GPU; 190,000 steps;
|
13 |
+
|
14 |
+
V2 (current): base model with TPU and larger corpus; 224,000 steps
|
15 |
|
16 |
## Classification
|
17 |
|
|
|
21 |
Notebook: https://colab.research.google.com/drive/1_rW9HZb6G87-5DraxHvhPOzGmSMUc67_?usp=sharin
|
22 |
|
23 |
The model outperformed mBERT on news classification:
|
24 |
+
(Random: 16.7%, mBERT: 53.0%, TaMillion: 75.1%)
|
25 |
|
26 |
The model slightly outperformed mBERT on movie reviews:
|
27 |
+
(RMSE - mBERT: 0.657, TaMillion: 0.626)
|
28 |
|
29 |
Equivalent accuracy on the Tirukkural topic task.
|
30 |
|
31 |
## Question Answering
|
32 |
|
33 |
+
I didn't find a Tamil-language question answering dataset, but this model could be finetuned
|
34 |
to train a QA model. See Hindi and Bengali examples here: https://colab.research.google.com/drive/1i6fidh2tItf_-IDkljMuaIGmEU6HT2Ar
|
35 |
|
36 |
## Corpus
|
37 |
|
38 |
+
Trained on
|
39 |
+
IndicCorp Tamil (11GB) https://indicnlp.ai4bharat.org/corpora/
|
40 |
+
and 1 October 2020 dump of https://ta.wikipedia.org (482MB)
|
41 |
|
42 |
## Vocabulary
|
43 |
|
44 |
+
Included as vocab.txt in the upload
|