|
convert this to raw readme.md file, it's a model card on huggingface |
|
|
|
# Pashto BERT (BERT-Base) |
|
|
|
## Model Overview |
|
This is a monolingual **Pashto BERT (BERT-Base)** model trained on a large **Pashto corpus**. The model is designed to understand and generate text in **Pashto**, making it suitable for various downstream **Natural Language Processing (NLP) tasks**. |
|
|
|
## Model Details |
|
- **Architecture:** BERT-Base (12 layers, 768 hidden size, 12 attention heads, 110M parameters) |
|
- **Language:** Pashto (ps) |
|
- **Training Corpus:** A diverse set of Pashto text data, including news articles, books, and web content. |
|
- **Special Tokens:** `[CLS]`, `[SEP]`, `[PAD]`, `[MASK]`, `[UNK]` |
|
|
|
## Intended Use |
|
This model can be **fine-tuned** for various Pashto-specific NLP tasks, such as: |
|
- **Sequence Classification:** Sentiment analysis, topic classification, and document categorization. |
|
- **Sequence Tagging:** Named entity recognition (NER) and part-of-speech (POS) tagging. |
|
- **Text Generation & Understanding:** Question answering, text summarization, and machine translation. |
|
|
|
## How to Use |
|
This model can be loaded using the `transformers` library from Hugging Face: |
|
|
|
```python |
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
model_name = "your-huggingface-username/pashto-bert-base" |
|
tokenizer = AutoTokenizer.from_pretrained("/kaggle/working/model/") |
|
model = AutoModel.from_pretrained(model_name) |
|
|
|
text = "ستاسو نننۍ ورځ څنګه وه؟" |
|
tokens = tokenizer(text, return_tensors="pt") |
|
out = model(**tokens) |
|
``` |
|
|
|
## Training Details |
|
- **Optimization:** AdamW |
|
- **Sequence Length:** 128 |
|
- **Warmup Steps:** 10,000 |
|
- **Warmup Ratio:** 0.06 |
|
- **Learning Rate:** 1e-4 |
|
- **Weight Decay:** 0.01 |
|
- **Adam Optimizer Parameters:** |
|
- **Epsilon:** 1e-8 |
|
- **Betas:** (0.9, 0.999) |
|
- **Gradient Accumulation Steps:** 1 |
|
- **Max Gradient Norm:** 1.0 |
|
- **Scheduler:** `linear_schedule_with_warmup` |
|
|
|
|
|
## Limitations & Biases |
|
- The model may reflect biases present in the training data. |
|
- Performance on **low-resource or domain-specific tasks** may require additional fine-tuning. |
|
- It is not trained for **code-switching scenarios** (e.g., mixing Pashto with English or other languages). |