pashto-bert-v1 / README.md
ijazulhaq's picture
Update README.md
fafe39f verified
convert this to raw readme.md file, it's a model card on huggingface
# Pashto BERT (BERT-Base)
## Model Overview
This is a monolingual **Pashto BERT (BERT-Base)** model trained on a large **Pashto corpus**. The model is designed to understand and generate text in **Pashto**, making it suitable for various downstream **Natural Language Processing (NLP) tasks**.
## Model Details
- **Architecture:** BERT-Base (12 layers, 768 hidden size, 12 attention heads, 110M parameters)
- **Language:** Pashto (ps)
- **Training Corpus:** A diverse set of Pashto text data, including news articles, books, and web content.
- **Special Tokens:** `[CLS]`, `[SEP]`, `[PAD]`, `[MASK]`, `[UNK]`
## Intended Use
This model can be **fine-tuned** for various Pashto-specific NLP tasks, such as:
- **Sequence Classification:** Sentiment analysis, topic classification, and document categorization.
- **Sequence Tagging:** Named entity recognition (NER) and part-of-speech (POS) tagging.
- **Text Generation & Understanding:** Question answering, text summarization, and machine translation.
## How to Use
This model can be loaded using the `transformers` library from Hugging Face:
```python
from transformers import AutoModel, AutoTokenizer
model_name = "your-huggingface-username/pashto-bert-base"
tokenizer = AutoTokenizer.from_pretrained("/kaggle/working/model/")
model = AutoModel.from_pretrained(model_name)
text = "ستاسو نننۍ ورځ څنګه وه؟"
tokens = tokenizer(text, return_tensors="pt")
out = model(**tokens)
```
## Training Details
- **Optimization:** AdamW
- **Sequence Length:** 128
- **Warmup Steps:** 10,000
- **Warmup Ratio:** 0.06
- **Learning Rate:** 1e-4
- **Weight Decay:** 0.01
- **Adam Optimizer Parameters:**
- **Epsilon:** 1e-8
- **Betas:** (0.9, 0.999)
- **Gradient Accumulation Steps:** 1
- **Max Gradient Norm:** 1.0
- **Scheduler:** `linear_schedule_with_warmup`
## Limitations & Biases
- The model may reflect biases present in the training data.
- Performance on **low-resource or domain-specific tasks** may require additional fine-tuning.
- It is not trained for **code-switching scenarios** (e.g., mixing Pashto with English or other languages).