Uzbek Text Classification Model(BERT)

This model is a fine-tuned BERT model for text classification in the Uzbek language. It has been trained to classify Uzbek text into various categories such as sports, politics, technology, etc. The model was fine-tuned using Hugging Face's transformers library.

Model Details

Model Description

Developed by: Abdumalikov Aziz
Model type: BERT for Sequence Classification
Language: Uzbek
License: Apache 2.0
Fine-tuned from: bert-base-multilingual-cased

This model can classify Uzbek text into 15 different categories, such as "Sport," "Dunyo," "Jamiyat," and more.

Model Sources

Repository: https://huggingface.co/abdumalikov/bert-finetuned-uzbek

Uses

Direct Use

The model can be used directly for text classification tasks in the Uzbek language. It is suitable for tasks like news categorization, sentiment analysis, and more.

Downstream Use

The model can be fine-tuned further for specific tasks or used as part of a larger NLP pipeline.

Out-of-Scope Use

The model is not suitable for generating text or for tasks outside text classification in Uzbek.

Bias, Risks, and Limitations

Bias

The model may have biases present in the training data, especially since the data may represent specific viewpoints or cultural contexts.

Risks

The model may misclassify text that does not fit well into the predefined categories. Users should be cautious when applying the model to texts outside the training domain.

Limitations

The model is limited to text classification and is not suitable for tasks such as question-answering or text generation.

How to Get Started with the Model

Here is an example of how to use the model with Hugging Face's pipeline:

from transformers import pipeline

pipe = pipeline("text-classification", model="abdumalikov/bert-finetuned-uzbek-text-classification")

pipe("Bugun o'zbekiston jamoasi volleybol bo'yicha jaxon chempioni bo'ldi")

Training Details

Training Data

The model was trained on a custom dataset of Uzbek text, which includes various categories such as sports, politics, technology, and more. The dataset consists of approximately 512,750 labeled examples.

Training Procedure

The model was fine-tuned using the following hyperparameters:

Epochs: 5
Batch size: 16
Learning rate: 2e-5
Optimizer: AdamW

Preprocessing

The data was tokenized using the BERT tokenizer. Special tokens were added for sentence boundaries.

Evaluation

Testing Data

The model was tested on a separate dataset of 50,000 examples.

Metrics

The following metrics were used to evaluate the model:

Accuracy: 86%
F1-score: 87%

Model Examination

The model's predictions were examined to ensure that it performs well across all categories.

Environmental Impact

Hardware Type: GeForce RTX 3090
Hours used: 5 hours
Carbon Emitted: Approximately 10 kg CO2eq

Technical Specifications

Model Architecture

The model is based on the BERT architecture for sequence classification.

Compute Infrastructure

The model was trained on an GeForce RTX 3090 GPU using the Hugging Face Trainer API.

Citation

BibTeX:

@misc{abdumalikov2025uzbekbert,
  title={Uzbek Text Classification using BERT},
  author={Abdumalikov},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/abdumalikov/bert-finetuned-uzbek}}
}

Model Card Authors

Abdumalikov Aziz

Model Card Contact

For questions or feedback, please contact abdumalikov via the Hugging Face platform.

abdumalikov
/

bert-finetuned-uzbek-text-classification