|
--- |
|
license: apache-2.0 |
|
language: |
|
- ks |
|
tags: |
|
- Text |
|
--- |
|
# Kashmir Text Generation Model |
|
|
|
## Model Overview |
|
This is a transformer-based text generation model designed for Kashmiri language text generation. The model uses a decoder-only architecture with positional encoding and self-attention mechanisms. |
|
|
|
## TRY LIVE DEMO ON SPACES |
|
[VIEW HERE (Click)](https://huggingface.co/spaces/Omarrran/kashmiri_text_generation_trail) |
|
|
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/66afb3f1eaf3e876595627bf/OY88f69T0yxwODUz7iDK8.png) |
|
|
|
## TRY LIVE DEMO ON SPACES |
|
[VIEW HERE (Click)](https://huggingface.co/spaces/Omarrran/kashmiri_text_generation_trail) |
|
|
|
|
|
## Intended Use |
|
- **Primary Use**: Generating coherent Kashmiri text continuations from given prompts |
|
- **Intended Users**: Researchers and developers working with Kashmiri language processing |
|
- **Out-of-Scope Uses**: Not intended for production deployment without further evaluation |
|
|
|
## Model Architecture |
|
- **Type**: Decoder-only Transformer |
|
- **Components**: |
|
- Positional Encoding |
|
- Embedding Layer |
|
- Transformer Decoder Layers |
|
- Linear Output Layer |
|
- **Implementation**: PyTorch |
|
|
|
|
|
|
|
|
|
This is a custom transformer-based text generation model for Kashmiri language. |
|
|
|
## Model Details |
|
- **Architecture:** Custom Transformer Decoder |
|
- **Vocabulary Size:** 36100 |
|
- **Embedding Dimension:** 256 |
|
- **Number of Layers:** 4 |
|
- **Number of Attention Heads:** 8 |
|
- **Sequence Length:** 64 |
|
- **Training Data:** Kashmiri text corpus |
|
|
|
|
|
|
|
|
|
|
|
|
|
## Technical Specifications |
|
- **Framework**: PyTorch |
|
- **Input**: Text prompts in Kashmiri |
|
- **Output**: Generated text continuation |
|
- **Model Parameters**: |
|
- Embedding Dimension: Specified in `model_config.json` |
|
- Number of Layers: Specified in `model_config.json` |
|
- Number of Attention Heads: Specified in `model_config.json` |
|
- Sequence Length: Specified in `model_config.json` |
|
- Dropout Rate: 0.2 |
|
|
|
## Files Structure |
|
``` |
|
├── root / |
|
│ ├── model.pt # Trained model weights |
|
│ ├── word_to_int.json # Word to integer mapping |
|
│ ├── int_to_word.json # Integer to word mapping |
|
│ └── model_config.json # Model configuration |
|
``` |
|
|
|
## NOTE |
|
1. Ensure all required files are present in the root directory |
|
## Setup in Google Colab |
|
|
|
1. Create a new Google Colab notebook |
|
2. Copy and paste the following code into a code cell: |
|
```python |
|
!git clone https://huggingface.co/Omarrran/Kashur_gpt_version_1 |
|
``` |
|
|
|
## Required Files |
|
|
|
The model requires the following files which will be downloaded from the HuggingFace repository: |
|
- `model.pt`: The trained model weights |
|
- `model_config.json`: Model configuration parameters |
|
- `word_to_int.json`: Vocabulary mapping from words to integers |
|
- `int_to_word.json`: Vocabulary mapping from integers to words |
|
|
|
|
|
## NOTE |
|
1. Ensure all required files are present in the root directory |
|
``` |
|
import os |
|
import shutil |
|
|
|
# Define the source and destination paths |
|
source_path = "/content/Kashur_gpt_version_1/" |
|
destination_path = "/content/" |
|
|
|
# Loop through all files in the source directory and move them to the destination |
|
for filename in os.listdir(source_path): |
|
file_path = os.path.join(source_path, filename) |
|
if os.path.isfile(file_path): |
|
shutil.move(file_path, destination_path) |
|
|
|
print(f"All files from {source_path} moved to {destination_path}") |
|
``` |
|
|
|
## Usage |
|
|
|
### 1. Import Required Libraries |
|
```python |
|
import torch |
|
import torch.nn as nn |
|
import torch.nn.functional as F |
|
import math |
|
import json |
|
import os |
|
|
|
``` |
|
# 2. Device configuration |
|
```python |
|
# Device configuration |
|
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') |
|
|
|
def generate_square_subsequent_mask(sz): |
|
mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1) |
|
mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0)) |
|
return mask |
|
|
|
class PositionalEncoding(nn.Module): |
|
def __init__(self, max_len, d_model, dropout=0.1): |
|
super(PositionalEncoding, self).__init__() |
|
self.dropout = nn.Dropout(p=dropout) |
|
pe = torch.zeros(max_len, d_model) |
|
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1) |
|
div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)) |
|
pe[:, 0::2] = torch.sin(position * div_term) |
|
pe[:, 1::2] = torch.cos(position * div_term) |
|
pe = pe.unsqueeze(0) |
|
self.register_buffer('pe', pe) |
|
|
|
def forward(self, x): |
|
x = x + self.pe[:, :x.size(1)] |
|
return self.dropout(x) |
|
|
|
class TextGen(nn.Module): |
|
def __init__(self, vocab_size, embed_dim, num_layers, num_heads, sequence_length): |
|
super(TextGen, self).__init__() |
|
self.pos_encoder = PositionalEncoding(max_len=sequence_length, d_model=embed_dim) |
|
self.emb = nn.Embedding(vocab_size, embed_dim) |
|
self.decoder_layer = nn.TransformerDecoderLayer(d_model=embed_dim, nhead=num_heads, batch_first=True) |
|
self.decoder = nn.TransformerDecoder(decoder_layer=self.decoder_layer, num_layers=num_layers) |
|
self.linear = nn.Linear(embed_dim, vocab_size) |
|
self.dropout = nn.Dropout(0.2) |
|
|
|
def forward(self, x): |
|
emb = self.emb(x) |
|
input_mask = generate_square_subsequent_mask(x.size(1)).to(x.device) |
|
x = self.pos_encoder(emb) |
|
x = self.decoder(x, memory=x, tgt_mask=input_mask, memory_mask=input_mask) |
|
x = self.dropout(x) |
|
out = self.linear(x) |
|
return out |
|
|
|
def load_model(): |
|
# Load configuration |
|
with open('model_config.json', 'r') as f: |
|
config = json.load(f) |
|
|
|
# Load vocabularies |
|
with open('word_to_int.json', 'r', encoding='utf-8') as f: |
|
word_to_int = json.load(f) |
|
with open('int_to_word.json', 'r', encoding='utf-8') as f: |
|
int_to_word = json.load(f) |
|
|
|
# Initialize model |
|
model = TextGen( |
|
vocab_size=config['vocab_size'], |
|
embed_dim=config['embed_dim'], |
|
num_layers=config['num_layers'], |
|
num_heads=config['num_heads'], |
|
sequence_length=config['sequence_length'] |
|
).to(device) |
|
|
|
# Load model weights |
|
model.load_state_dict(torch.load('model.pt', map_location=device)) |
|
model.eval() |
|
|
|
return model, word_to_int, int_to_word, config['sequence_length'] |
|
|
|
@torch.no_grad() |
|
def generate_text(model, prompt, word_to_int, int_to_word, sequence_length, max_length=100, temperature=1.0): |
|
model.eval() |
|
words = prompt.split() |
|
current_seq = torch.LongTensor([word_to_int.get(w, 0) for w in words]).unsqueeze(0).to(device) |
|
|
|
for _ in range(max_length): |
|
if current_seq.size(1) > sequence_length: |
|
current_seq = current_seq[:, -sequence_length:] |
|
|
|
output = model(current_seq) |
|
next_token_logits = output[:, -1, :] / temperature |
|
next_token = torch.multinomial(F.softmax(next_token_logits, dim=-1), num_samples=1) |
|
|
|
current_seq = torch.cat([current_seq, next_token], dim=1) |
|
next_word = int_to_word.get(str(next_token.item()), "<UNK>") |
|
words.append(next_word) |
|
|
|
if next_word == ".": |
|
break |
|
|
|
return " ".join(words) |
|
|
|
if __name__ == "__main__": |
|
# Load model and required files |
|
model, word_to_int, int_to_word, sequence_length = load_model() |
|
|
|
``` |
|
|
|
### Load the Model |
|
The model will automatically load after running the provided code above. It uses either GPU (if available) or CPU. |
|
|
|
|
|
### 3. Generate Text |
|
To generate text, use the following format: |
|
```python |
|
# Example prompt (in Kashmiri) |
|
prompt = " دِتم مصمت۔یم بگُل غلام چھُ آں تس اکھ حمزہ گویی" # Replace With your Kashmiri text prompt |
|
|
|
generated_text = generate_text( |
|
model, |
|
prompt, |
|
word_to_int, |
|
int_to_word, |
|
sequence_length, |
|
max_length=100 # Adjust this value to control output length |
|
) |
|
print(f"Generated text: {generated_text}") |
|
``` |
|
|
|
## Parameters |
|
|
|
You can adjust the following parameters for text generation: |
|
- `max_length`: Maximum number of words to generate (default: 100) |
|
- `temperature`: Controls randomness in generation (default: 1.0) |
|
- Higher values (>1.0) make the output more random |
|
- Lower values (<1.0) make the output more focused and deterministic |
|
|
|
|
|
## Generation Parameters |
|
- **Temperature**: Controls randomness in generation (default: 1.0) |
|
- Higher values (>1.0) result in more diverse outputs |
|
- Lower values (<1.0) make the output more deterministic |
|
- **Max Length**: Maximum number of tokens to generate (default: 100) |
|
- **Sequence Length**: Maximum context window size (specified in config) |
|
|
|
## Limitations |
|
- The model operates at word-level tokenization |
|
- Limited by the maximum sequence length specified in the configuration |
|
- Generation stops at the first period (.) encountered |
|
- Performance may vary based on input prompt quality and length |
|
|
|
## Performance Considerations |
|
- Runs on both CPU and CUDA-enabled GPUs |
|
- Memory usage scales with sequence length and batch size |
|
- Inference speed depends on hardware capabilities and generation parameters |
|
|
|
## Dependencies |
|
- Python 3.6+ |
|
- PyTorch |
|
- Math |
|
- JSON |
|
- OS |
|
|
|
## License |
|
[See above card] |
|
|
|
|
|
## Citation |
|
|
|
If you use this model in your research, please cite: |
|
|
|
```bibtex |
|
@misc{{kashmiri_text_gen, |
|
author = {{Haq Nawaz Malik}}, |
|
title = {{Kashmiri Text Generation Model}}, |
|
year = {{2024}}, |
|
journal = {{for Preprint}}, |
|
howpublished = {{\\url{{https://huggingface.co/Omarrran/kashmiri_text_gen_model}}}} |
|
}} |
|
``` |
|
|
|
## Contact |
|
[Add contact information for model maintainers] |
|
|
|
## Updates and Maintenance |
|
- Version: 1.0 |
|
- Last Updated: [26-10-2024] |
|
- [Working to make an updated version] |