Uzbek Text Classification Model(BERT)
This model is a fine-tuned BERT model for text classification in the Uzbek language. It has been trained to classify Uzbek text into various categories such as sports, politics, technology, etc. The model was fine-tuned using Hugging Face's transformers
library.
Model Details
Model Description
- Developed by: Abdumalikov Aziz
- Model type: BERT for Sequence Classification
- Language: Uzbek
- License: Apache 2.0
- Fine-tuned from:
bert-base-multilingual-cased
This model can classify Uzbek text into 15 different categories, such as "Sport," "Dunyo," "Jamiyat," and more.
Model Sources
Uses
Direct Use
The model can be used directly for text classification tasks in the Uzbek language. It is suitable for tasks like news categorization, sentiment analysis, and more.
Downstream Use
The model can be fine-tuned further for specific tasks or used as part of a larger NLP pipeline.
Out-of-Scope Use
The model is not suitable for generating text or for tasks outside text classification in Uzbek.
Bias, Risks, and Limitations
Bias
The model may have biases present in the training data, especially since the data may represent specific viewpoints or cultural contexts.
Risks
The model may misclassify text that does not fit well into the predefined categories. Users should be cautious when applying the model to texts outside the training domain.
Limitations
The model is limited to text classification and is not suitable for tasks such as question-answering or text generation.
How to Get Started with the Model
Here is an example of how to use the model with Hugging Face's pipeline
:
from transformers import pipeline
pipe = pipeline("text-classification", model="abdumalikov/bert-finetuned-uzbek-text-classification")
pipe("Bugun o'zbekiston jamoasi volleybol bo'yicha jaxon chempioni bo'ldi")
Training Details
Training Data
The model was trained on a custom dataset of Uzbek text, which includes various categories such as sports, politics, technology, and more. The dataset consists of approximately 512,750 labeled examples.
Training Procedure
The model was fine-tuned using the following hyperparameters:
- Epochs: 5
- Batch size: 16
- Learning rate: 2e-5
- Optimizer: AdamW
Preprocessing
The data was tokenized using the BERT tokenizer. Special tokens were added for sentence boundaries.
Evaluation
Testing Data
The model was tested on a separate dataset of 50,000 examples.
Metrics
The following metrics were used to evaluate the model:
- Accuracy: 86%
- F1-score: 87%
Model Examination
The model's predictions were examined to ensure that it performs well across all categories.
Environmental Impact
Hardware Type: GeForce RTX 3090
Hours used: 5 hours
Carbon Emitted: Approximately 10 kg CO2eq
Technical Specifications
Model Architecture
The model is based on the BERT architecture for sequence classification.
Compute Infrastructure
The model was trained on an GeForce RTX 3090 GPU using the Hugging Face Trainer API.
Citation
BibTeX:
@misc{abdumalikov2025uzbekbert,
title={Uzbek Text Classification using BERT},
author={Abdumalikov},
year={2025},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/abdumalikov/bert-finetuned-uzbek}}
}
Model Card Authors
- Abdumalikov Aziz
Model Card Contact
For questions or feedback, please contact abdumalikov via the Hugging Face platform.
- Downloads last month
- 9
Model tree for abdumalikov/bert-finetuned-uzbek-text-classification
Base model
google-bert/bert-base-uncased