ancatmara commited on
Commit
b862bc9
1 Parent(s): e893505

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +34 -1
README.md CHANGED
@@ -3,4 +3,37 @@ license: cc-by-nc-sa-4.0
3
  language:
4
  - ga
5
  library_name: transformers
6
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  language:
4
  - ga
5
  library_name: transformers
6
+ ---
7
+
8
+ **Historical Irish WordPiece tokenizer** was trained on Old, Middle, Early Modern, Classical Modern and pre-reform Modern Irish texts from St. Gall Glosses, Würzburg Glosses, CELT and the book subcorpus Historical Irish Corpus. The training data spans ca. 550 — 1926 and covers a wide variety of genres, such as bardic poetry, native Irish stories, translations and adaptations of continental epic and romance, annals, genealogies, grammatical and medical tracts, diaries, and religious writing. Due to code-switching in some texts, the model has some Latin in the vocabulary.
9
+
10
+ WordPiece tokenizer is used in BERT and its derivatives.
11
+
12
+ ### Use
13
+
14
+ ```python
15
+ from transformers import AutoTokenizer
16
+
17
+ tokenizer = AutoTokenizer.from_pretrained("ancatmara/historical-irish-tokenizer-wordpiece")
18
+ texts = ['Boí Óengus in n-aidchi n-aili inna chotlud.', 'Co n-accae ní, in n-ingin cucci for crunn síuil dó.']
19
+
20
+ tokenizer(texts, max_length=128, truncation=True)
21
+ ```
22
+
23
+ Out:
24
+
25
+ ```python
26
+ >>> {'input_ids': [[0, 905, 2526, 158, 55, 18, 2561, 55, 18, 2259, 1676, 10924, 19, 2], [0, 154, 55, 18, 4457, 106, 207, 17, 158, 55, 18, 2139, 11166, 98, 222, 7499, 20032, 148, 19, 2]],
27
+ 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
28
+ 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
29
+ ```
30
+
31
+ ```python
32
+ tokenizer.decode([0, 905, 2526, 158, 55, 18, 2561, 55, 18, 2259, 1676, 10924, 19, 2])
33
+ ```
34
+
35
+ Out:
36
+
37
+ ```python
38
+ >>> '<s> boi oengus in n - aidchi n - aili inna chotlud. </s>'
39
+ ```