pashto-bert-v1 / README.md

Update README.md

fafe39f verified about 2 months ago

2.19 kB

	convert this to raw readme.md file, it's a model card on huggingface

	# Pashto BERT (BERT-Base)

	## Model Overview
	This is a monolingual Pashto BERT (BERT-Base) model trained on a large Pashto corpus. The model is designed to understand and generate text in Pashto, making it suitable for various downstream Natural Language Processing (NLP) tasks.

	## Model Details
	- Architecture: BERT-Base (12 layers, 768 hidden size, 12 attention heads, 110M parameters)
	- Language: Pashto (ps)
	- Training Corpus: A diverse set of Pashto text data, including news articles, books, and web content.
	- Special Tokens: `[CLS]`, `[SEP]`, `[PAD]`, `[MASK]`, `[UNK]`

	## Intended Use
	This model can be fine-tuned for various Pashto-specific NLP tasks, such as:
	- Sequence Classification: Sentiment analysis, topic classification, and document categorization.
	- Sequence Tagging: Named entity recognition (NER) and part-of-speech (POS) tagging.
	- Text Generation & Understanding: Question answering, text summarization, and machine translation.

	## How to Use
	This model can be loaded using the `transformers` library from Hugging Face:

	```python
	from transformers import AutoModel, AutoTokenizer

	model_name = "your-huggingface-username/pashto-bert-base"
	tokenizer = AutoTokenizer.from_pretrained("/kaggle/working/model/")
	model = AutoModel.from_pretrained(model_name)

	text = "ستاسو نننۍ ورځ څنګه وه؟"
	tokens = tokenizer(text, return_tensors="pt")
	out = model(**tokens)
	```

	## Training Details
	- Optimization: AdamW
	- Sequence Length: 128
	- Warmup Steps: 10,000
	- Warmup Ratio: 0.06
	- Learning Rate: 1e-4
	- Weight Decay: 0.01
	- Adam Optimizer Parameters:
	- Epsilon: 1e-8
	- Betas: (0.9, 0.999)
	- Gradient Accumulation Steps: 1
	- Max Gradient Norm: 1.0
	- Scheduler: `linear_schedule_with_warmup`


	## Limitations & Biases
	- The model may reflect biases present in the training data.
	- Performance on low-resource or domain-specific tasks may require additional fine-tuning.
	- It is not trained for code-switching scenarios (e.g., mixing Pashto with English or other languages).

	convert this to raw readme.md file, it's a model card on huggingface

	# Pashto BERT (BERT-Base)

	## Model Overview
	This is a monolingual Pashto BERT (BERT-Base) model trained on a large Pashto corpus. The model is designed to understand and generate text in Pashto, making it suitable for various downstream Natural Language Processing (NLP) tasks.

	## Model Details
	- Architecture: BERT-Base (12 layers, 768 hidden size, 12 attention heads, 110M parameters)
	- Language: Pashto (ps)
	- Training Corpus: A diverse set of Pashto text data, including news articles, books, and web content.
	- Special Tokens: `[CLS]`, `[SEP]`, `[PAD]`, `[MASK]`, `[UNK]`

	## Intended Use
	This model can be fine-tuned for various Pashto-specific NLP tasks, such as:
	- Sequence Classification: Sentiment analysis, topic classification, and document categorization.
	- Sequence Tagging: Named entity recognition (NER) and part-of-speech (POS) tagging.
	- Text Generation & Understanding: Question answering, text summarization, and machine translation.

	## How to Use
	This model can be loaded using the `transformers` library from Hugging Face:

	```python
	from transformers import AutoModel, AutoTokenizer

	model_name = "your-huggingface-username/pashto-bert-base"
	tokenizer = AutoTokenizer.from_pretrained("/kaggle/working/model/")
	model = AutoModel.from_pretrained(model_name)

	text = "ستاسو نننۍ ورځ څنګه وه؟"
	tokens = tokenizer(text, return_tensors="pt")
	out = model(**tokens)
	```

	## Training Details
	- Optimization: AdamW
	- Sequence Length: 128
	- Warmup Steps: 10,000
	- Warmup Ratio: 0.06
	- Learning Rate: 1e-4
	- Weight Decay: 0.01
	- Adam Optimizer Parameters:
	- Epsilon: 1e-8
	- Betas: (0.9, 0.999)
	- Gradient Accumulation Steps: 1
	- Max Gradient Norm: 1.0
	- Scheduler: `linear_schedule_with_warmup`


	## Limitations & Biases
	- The model may reflect biases present in the training data.
	- Performance on low-resource or domain-specific tasks may require additional fine-tuning.
	- It is not trained for code-switching scenarios (e.g., mixing Pashto with English or other languages).