---
library_name: transformers
tags:
- LLM
- Multilingual
- Dual Transformer
- Non-English
- Tokenizer
- Assamese
- Indian
---

# Assamese Tokenizer (50K Vocabulary)

[![Downloads](https://img.shields.io/github/downloads/tamang0000/assamese-tokenizer-50k/total.svg)](https://github.com/tamang0000/assamese-tokenizer-50k/releases)

## Model Details

This repository contains a custom tokenizer for the Assamese language with a vocabulary size of 50,000 tokens. The tokenizer was trained on the Assamese language subset of the CC-100 multilingual dataset. This tokenizer can be used for various Natural Language Processing (NLP) tasks involving the Assamese language.

## Repository Details

- **Repository Name:** tamang0000/assamese-tokenizer-50k
- **Tokenizer Vocabulary Size:** 50,000 tokens
- **Training Dataset:** CC-100 Multilingual Dataset (Assamese Language Subset)
- **Model Type:** Tokenizer
- **Framework:** Hugging Face Transformers
- **License:** MIT License

## Tokenizer Usage

You can load and use this tokenizer with the Hugging Face `transformers` library. Below are the steps to load and use the tokenizer in your projects.

## Training Details

- **Dataset:** The tokenizer was trained exclusively on the Assamese language subset of the CC-100 multilingual dataset.
- **Vocabulary Size:** 50,000 tokens.
- **Normalization:** Includes normalization steps such as lowercasing and stripping accents.