tejagowda commited on
Commit
35a8689
1 Parent(s): 543e30e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +86 -1
README.md CHANGED
@@ -6,4 +6,89 @@ pipeline_tag: text-classification
6
  tags:
7
  - transformer
8
  - tokenizer
9
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  tags:
7
  - transformer
8
  - tokenizer
9
+ ---
10
+ ---
11
+
12
+ language:
13
+ - he
14
+ - en
15
+ pipeline_tag: text-classification
16
+ tags:
17
+ - transformer
18
+ - tokenizer
19
+
20
+ ---
21
+
22
+ # Model Overview
23
+
24
+ **Model Name:** T5 Hebrew-to-English Translation Tokenizer
25
+ **Model Type:** Tokenizer for Transformer-based models
26
+ **Base Model:** T5 (Text-to-Text Transfer Transformer)
27
+ **Preprocessing:** Custom Tokenizer using SentencePieceBPETokenizer
28
+ **Training Data:** Custom Hebrew-English dataset curated for translation tasks
29
+ **Intended Use:** This tokenizer is intended for machine translation tasks, specifically Hebrew-to-English translations.
30
+
31
+ ## Model Description
32
+
33
+ This tokenizer was trained on a Hebrew-to-English dataset using `SentencePieceBPETokenizer`. It is optimized for handling Hebrew text tokenization and can be paired with a Transformer model, such as T5, for sequence-to-sequence translation tasks. It handles preprocessing tasks like tokenization, padding, and truncation effectively.
34
+
35
+ ## Performance
36
+
37
+ - **Task:** Hebrew-to-English Translation (Tokenizer only)
38
+ - **Dataset:** A custom dataset containing parallel Hebrew-English sentences
39
+ - **Metrics:**
40
+ - Vocabulary size: 30,000 tokens
41
+ - Tokenization accuracy: Not applicable (Tokenizer-specific metric)
42
+
43
+ ## Usage
44
+
45
+ ### How to Use the Tokenizer
46
+
47
+ To use this tokenizer, you can load it using the Hugging Face Transformers library:
48
+
49
+ ```python
50
+ from transformers import AutoTokenizer
51
+
52
+ # Load the tokenizer
53
+ tokenizer = AutoTokenizer.from_pretrained("tejagowda/t5-hebrew-translation", use_fast=False)
54
+
55
+ # Example: Tokenizing a Hebrew sentence
56
+ hebrew_text = "\u05D0\u05EA\u05D4\u05D3 \u05E2\u05DC \u05D4\u05D7\u05D5\u05DE\u05E8\u05D4."
57
+ inputs = tokenizer(hebrew_text, return_tensors="pt")
58
+
59
+ print("Tokens:", inputs["input_ids"])
60
+ ```
61
+
62
+ ### Example Usage with a Pretrained Model
63
+
64
+ To perform translation, you can pair this tokenizer with a pretrained T5 model:
65
+
66
+ ```python
67
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
68
+
69
+ # Load the tokenizer and model
70
+ tokenizer = AutoTokenizer.from_pretrained("tejagowda/t5-hebrew-translation", use_fast=False)
71
+ model = AutoModelForSeq2SeqLM.from_pretrained("t5-small") # Replace with fine-tuned model if available
72
+
73
+ # Hebrew text to translate
74
+ hebrew_text = "\u05EA\u05D0\u05E8 \u05D0\u05EA \u05DE\u05D1\u05E0\u05D4 \u05E9\u05DC \u05D0\u05D8\u05D5\u05DD."
75
+
76
+ # Tokenize and translate
77
+ inputs = tokenizer(hebrew_text, return_tensors="pt")
78
+ outputs = model.generate(inputs["input_ids"], max_length=100)
79
+
80
+ # Decode the output
81
+ english_translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
82
+
83
+ print("Translation:", english_translation)
84
+ ```
85
+
86
+ ## Limitations
87
+
88
+ - The tokenizer itself does not perform translation; it must be paired with a translation model.
89
+ - Performance depends on the quality of the paired model and training data.
90
+
91
+ ## License
92
+
93
+ This tokenizer is licensed under the Apache 2.0 License. See the LICENSE file for more details.
94
+