File size: 1,248 Bytes
c9e8e5c
3250b7d
 
 
 
 
7fa3768
3250b7d
 
 
7fa3768
3250b7d
 
c9e8e5c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
---
library_name: transformers
tags:
- LLM
- Multilingual
- Transformer
- Non-English
- Tokenizer
- Indian
- Assamese
---

# Assamese Tokenizer (50K Vocabulary)

## Model Details

This repository contains a custom tokenizer for the Assamese language with a vocabulary size of 50,000 tokens. The tokenizer was trained on the Assamese language subset of the CC-100 multilingual dataset. This tokenizer can be used for various Natural Language Processing (NLP) tasks involving the Assamese language.

## Repository Details

- **Repository Name:** tamang0000/assamese-tokenizer-50k
- **Tokenizer Vocabulary Size:** 50,000 tokens
- **Training Dataset:** CC-100 Multilingual Dataset (Assamese Language Subset)
- **Model Type:** Tokenizer
- **Framework:** Hugging Face Transformers
- **License:** MIT License

## Tokenizer Usage

You can load and use this tokenizer with the Hugging Face `transformers` library. Below are the steps to load and use the tokenizer in your projects.

## Training Details

- **Dataset:** The tokenizer was trained exclusively on the Assamese language subset of the CC-100 multilingual dataset.
- **Vocabulary Size:** 50,000 tokens.
- **Normalization:** Includes normalization steps such as lowercasing and stripping accents.