Model Overview

Model Name: T5 Hebrew-to-English Translation Tokenizer
Model Type: Tokenizer for Transformer-based models
Base Model: T5 (Text-to-Text Transfer Transformer)
Preprocessing: Custom Tokenizer using SentencePieceBPETokenizer
Training Data: Custom Hebrew-English dataset curated for translation tasks
Intended Use: This tokenizer is intended for machine translation tasks, specifically Hebrew-to-English translations.

Model Description

This tokenizer was trained on a Hebrew-to-English dataset using SentencePieceBPETokenizer. It is optimized for handling Hebrew text tokenization and can be paired with a Transformer model, such as T5, for sequence-to-sequence translation tasks. It handles preprocessing tasks like tokenization, padding, and truncation effectively.

Performance

Task: Hebrew-to-English Translation (Tokenizer only)
Dataset: A custom dataset containing parallel Hebrew-English sentences
Metrics:
- Vocabulary size: 30,000 tokens
- Tokenization accuracy: Not applicable (Tokenizer-specific metric)

Usage

How to Use the Tokenizer

To use this tokenizer, you can load it using the Hugging Face Transformers library:

from transformers import AutoTokenizer

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("tejagowda/t5-hebrew-translation", use_fast=False)

# Example: Tokenizing a Hebrew sentence
hebrew_text = "\u05D0\u05EA\u05D4\u05D3 \u05E2\u05DC \u05D4\u05D7\u05D5\u05DE\u05E8\u05D4."
inputs = tokenizer(hebrew_text, return_tensors="pt")

print("Tokens:", inputs["input_ids"])

Example Usage with a Pretrained Model

To perform translation, you can pair this tokenizer with a pretrained T5 model:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("tejagowda/t5-hebrew-translation", use_fast=False)
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")  # Replace with fine-tuned model if available

# Hebrew text to translate
hebrew_text = "\u05EA\u05D0\u05E8 \u05D0\u05EA \u05DE\u05D1\u05E0\u05D4 \u05E9\u05DC \u05D0\u05D8\u05D5\u05DD."

# Tokenize and translate
inputs = tokenizer(hebrew_text, return_tensors="pt")
outputs = model.generate(inputs["input_ids"], max_length=100)

# Decode the output
english_translation = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("Translation:", english_translation)

Limitations

The tokenizer itself does not perform translation; it must be paired with a translation model.
Performance depends on the quality of the paired model and training data.

License

This tokenizer is licensed under the Apache 2.0 License. See the LICENSE file for more details.