You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Greek Tokenizer

Tokenizer trained from scratch based on BPE algorithm on Greek corpus.

Usage:

To use this tokenizer, you can load it from the Hugging Face Hub:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gsar78/Greek_Tokenizer")

Example:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gsar78/Greek_Tokenizer")

# Tokenize input text
input_text = "Αυτό είναι ένα παράδειγμα."
inputs = tokenizer(input_text, return_tensors="pt")

# Print the tokenized input (IDs and tokens)
print("Token IDs:", inputs["input_ids"].tolist())

# Convert token IDs to tokens
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
print("Tokens:", tokens)

# Manually join tokens to form the tokenized string
tokenized_string = ' '.join(tokens)
print("Tokenized String:", tokenized_string)

It can also be used as a head start for pretraining a GPT2 base model on the Greek language.

Training Details

Vocabulary Size: 52000

Special Tokens: [PAD], [UNK], [CLS], [SEP], [MASK]

Benefits and Why to use:

Many generic tokenizers split words in multiple tokens.

This tokenizer, efficient and only splits words that was not trained on.

In the example above , the output of this tokenizer is only five tokens, while another tokenizer e.g. Llama-3 results in 9 or more tokens.

This can have an impact in inference costs and downstream applications.

Update

July 2024:

A new version is imminent that further decrease the fertility for Greek and English language.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model's library. Check the docs .