--- library_name: transformers tags: - LLM - Multilingual - Dual Transformer - Non-English - Tokenizer - Assamese - Indian --- # Assamese Tokenizer (50K Vocabulary) [![Downloads](https://img.shields.io/github/downloads/tamang0000/assamese-tokenizer-50k/total.svg)](https://github.com/tamang0000/assamese-tokenizer-50k/releases) ## Model Details This repository contains a custom tokenizer for the Assamese language with a vocabulary size of 50,000 tokens. The tokenizer was trained on the Assamese language subset of the CC-100 multilingual dataset. This tokenizer can be used for various Natural Language Processing (NLP) tasks involving the Assamese language. ## Repository Details - **Repository Name:** tamang0000/assamese-tokenizer-50k - **Tokenizer Vocabulary Size:** 50,000 tokens - **Training Dataset:** CC-100 Multilingual Dataset (Assamese Language Subset) - **Model Type:** Tokenizer - **Framework:** Hugging Face Transformers - **License:** MIT License ## Tokenizer Usage You can load and use this tokenizer with the Hugging Face `transformers` library. Below are the steps to load and use the tokenizer in your projects. ## Training Details - **Dataset:** The tokenizer was trained exclusively on the Assamese language subset of the CC-100 multilingual dataset. - **Vocabulary Size:** 50,000 tokens. - **Normalization:** Includes normalization steps such as lowercasing and stripping accents.