tamang0000's picture
Update README.md
7fa3768 verified
---
library_name: transformers
tags:
- LLM
- Multilingual
- Transformer
- Non-English
- Tokenizer
- Indian
- Assamese
---
# Assamese Tokenizer (50K Vocabulary)
## Model Details
This repository contains a custom tokenizer for the Assamese language with a vocabulary size of 50,000 tokens. The tokenizer was trained on the Assamese language subset of the CC-100 multilingual dataset. This tokenizer can be used for various Natural Language Processing (NLP) tasks involving the Assamese language.
## Repository Details
- **Repository Name:** tamang0000/assamese-tokenizer-50k
- **Tokenizer Vocabulary Size:** 50,000 tokens
- **Training Dataset:** CC-100 Multilingual Dataset (Assamese Language Subset)
- **Model Type:** Tokenizer
- **Framework:** Hugging Face Transformers
- **License:** MIT License
## Tokenizer Usage
You can load and use this tokenizer with the Hugging Face `transformers` library. Below are the steps to load and use the tokenizer in your projects.
## Training Details
- **Dataset:** The tokenizer was trained exclusively on the Assamese language subset of the CC-100 multilingual dataset.
- **Vocabulary Size:** 50,000 tokens.
- **Normalization:** Includes normalization steps such as lowercasing and stripping accents.