Update README.md
Browse files
README.md
CHANGED
@@ -10,19 +10,22 @@ metrics:
|
|
10 |
- perplexity
|
11 |
pipeline_tag: fill-mask
|
12 |
widget:
|
13 |
-
- text: ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ
|
14 |
example_title: Example 1
|
15 |
-
- text: ባለፉት አምስት ዓመታት የአውሮጳ ሀገራት የጦር
|
16 |
example_title: Example 2
|
17 |
-
- text: ኬንያውያን ከዳር እስከዳር በአንድ ቆመው የተቃውሞ ድምጻቸውን ማሰማታቸውን ተከትሎ የዜጎችን ቁጣ የቀሰቀሰው የቀረጥ ጭማሪ ሕግ ትናንት በፕሬዝደንት ዊልያም ሩቶ
|
18 |
example_title: Example 3
|
19 |
-
- text: ተማሪዎቹ በውድድሩ ካሸነፉበት የፈጠራ ስራ መካከል
|
20 |
example_title: Example 4
|
21 |
---
|
22 |
|
23 |
# roberta-base-amharic
|
24 |
|
25 |
-
This model has the same architecture as [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) and was pretrained from scratch using the Amharic subsets of the [oscar](https://huggingface.co/datasets/oscar), [mc4](https://huggingface.co/datasets/mc4), and [amharic-sentences-corpus](https://huggingface.co/datasets/rasyosef/amharic-sentences-corpus) datasets, on a total of **290 Million tokens**. The tokenizer was trained from scratch on the same text corpus, and had a vocabulary size of 32k.
|
|
|
|
|
|
|
26 |
It achieves the following results on the evaluation set:
|
27 |
|
28 |
- `Loss: 2.09`
|
|
|
10 |
- perplexity
|
11 |
pipeline_tag: fill-mask
|
12 |
widget:
|
13 |
+
- text: ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ <mask> ተቆጥሯል።
|
14 |
example_title: Example 1
|
15 |
+
- text: ባለፉት አምስት ዓመታት የአውሮጳ ሀገራት የጦር <mask> ግዢ በእጅጉ ጨምሯል።
|
16 |
example_title: Example 2
|
17 |
+
- text: ኬንያውያን ከዳር እስከዳር በአንድ ቆመው የተቃውሞ ድምጻቸውን ማሰማታቸውን ተከትሎ የዜጎችን ቁጣ የቀሰቀሰው የቀረጥ ጭማሪ ሕግ ትናንት በፕሬዝደንት ዊልያም ሩቶ <mask> ቢደረግም ዛሬም ግን የተቃውሞው እንቅስቃሴ መቀጠሉ እየተነገረ ነው።
|
18 |
example_title: Example 3
|
19 |
+
- text: ተማሪዎቹ በውድድሩ ካሸነፉበት የፈጠራ ስራ መካከል <mask> እና ቅዝቃዜን እንደአየር ሁኔታው የሚያስተካክል ጃኬት አንዱ ነው።
|
20 |
example_title: Example 4
|
21 |
---
|
22 |
|
23 |
# roberta-base-amharic
|
24 |
|
25 |
+
This model has the same architecture as [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) and was pretrained from scratch using the Amharic subsets of the [oscar](https://huggingface.co/datasets/oscar), [mc4](https://huggingface.co/datasets/mc4), and [amharic-sentences-corpus](https://huggingface.co/datasets/rasyosef/amharic-sentences-corpus) datasets, on a total of **290 Million tokens**. The tokenizer was trained from scratch on the same text corpus, and had a vocabulary size of 32k.
|
26 |
+
|
27 |
+
The model was trained for **22 hours** on an **A100 40GB GPU**.
|
28 |
+
|
29 |
It achieves the following results on the evaluation set:
|
30 |
|
31 |
- `Loss: 2.09`
|