Uzbek Text Classification Model(BERT)

This model is a fine-tuned BERT model for text classification in the Uzbek language. It has been trained to classify Uzbek text into various categories such as sports, politics, technology, etc. The model was fine-tuned using Hugging Face's transformers library.

Model Details

Model Description

  • Developed by: Abdumalikov Aziz
  • Model type: BERT for Sequence Classification
  • Language: Uzbek
  • License: Apache 2.0
  • Fine-tuned from: bert-base-multilingual-cased

This model can classify Uzbek text into 15 different categories, such as "Sport," "Dunyo," "Jamiyat," and more.

Model Sources

Uses

Direct Use

The model can be used directly for text classification tasks in the Uzbek language. It is suitable for tasks like news categorization, sentiment analysis, and more.

Downstream Use

The model can be fine-tuned further for specific tasks or used as part of a larger NLP pipeline.

Out-of-Scope Use

The model is not suitable for generating text or for tasks outside text classification in Uzbek.

Bias, Risks, and Limitations

Bias

The model may have biases present in the training data, especially since the data may represent specific viewpoints or cultural contexts.

Risks

The model may misclassify text that does not fit well into the predefined categories. Users should be cautious when applying the model to texts outside the training domain.

Limitations

The model is limited to text classification and is not suitable for tasks such as question-answering or text generation.

How to Get Started with the Model

Here is an example of how to use the model with Hugging Face's pipeline:

from transformers import pipeline

pipe = pipeline("text-classification", model="abdumalikov/bert-finetuned-uzbek-text-classification")

pipe("Bugun o'zbekiston jamoasi volleybol bo'yicha jaxon chempioni bo'ldi")

Training Details

Training Data

The model was trained on a custom dataset of Uzbek text, which includes various categories such as sports, politics, technology, and more. The dataset consists of approximately 512,750 labeled examples.

Training Procedure

The model was fine-tuned using the following hyperparameters:

  • Epochs: 5
  • Batch size: 16
  • Learning rate: 2e-5
  • Optimizer: AdamW

Preprocessing

The data was tokenized using the BERT tokenizer. Special tokens were added for sentence boundaries.

Evaluation

Testing Data

The model was tested on a separate dataset of 50,000 examples.

Metrics

The following metrics were used to evaluate the model:

  • Accuracy: 86%
  • F1-score: 87%

Model Examination

The model's predictions were examined to ensure that it performs well across all categories.

Environmental Impact

  • Hardware Type: GeForce RTX 3090

  • Hours used: 5 hours

  • Carbon Emitted: Approximately 10 kg CO2eq

Technical Specifications

Model Architecture

The model is based on the BERT architecture for sequence classification.

Compute Infrastructure

The model was trained on an GeForce RTX 3090 GPU using the Hugging Face Trainer API.

Citation

BibTeX:

@misc{abdumalikov2025uzbekbert,
  title={Uzbek Text Classification using BERT},
  author={Abdumalikov},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/abdumalikov/bert-finetuned-uzbek}}
}

Model Card Authors

  • Abdumalikov Aziz

Model Card Contact

For questions or feedback, please contact abdumalikov via the Hugging Face platform.

Downloads last month
9
Safetensors
Model size
109M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for abdumalikov/bert-finetuned-uzbek-text-classification

Finetuned
(2421)
this model

Dataset used to train abdumalikov/bert-finetuned-uzbek-text-classification