Omarrran's picture
Update README.md
796abc2 verified
---
license: apache-2.0
language:
- ks
tags:
- Text
---
# Kashmir Text Generation Model
## Model Overview
This is a transformer-based text generation model designed for Kashmiri language text generation. The model uses a decoder-only architecture with positional encoding and self-attention mechanisms.
## TRY LIVE DEMO ON SPACES
[VIEW HERE (Click)](https://huggingface.co/spaces/Omarrran/kashmiri_text_generation_trail)
![image/png](https://cdn-uploads.huggingface.co/production/uploads/66afb3f1eaf3e876595627bf/OY88f69T0yxwODUz7iDK8.png)
## TRY LIVE DEMO ON SPACES
[VIEW HERE (Click)](https://huggingface.co/spaces/Omarrran/kashmiri_text_generation_trail)
## Intended Use
- **Primary Use**: Generating coherent Kashmiri text continuations from given prompts
- **Intended Users**: Researchers and developers working with Kashmiri language processing
- **Out-of-Scope Uses**: Not intended for production deployment without further evaluation
## Model Architecture
- **Type**: Decoder-only Transformer
- **Components**:
- Positional Encoding
- Embedding Layer
- Transformer Decoder Layers
- Linear Output Layer
- **Implementation**: PyTorch
This is a custom transformer-based text generation model for Kashmiri language.
## Model Details
- **Architecture:** Custom Transformer Decoder
- **Vocabulary Size:** 36100
- **Embedding Dimension:** 256
- **Number of Layers:** 4
- **Number of Attention Heads:** 8
- **Sequence Length:** 64
- **Training Data:** Kashmiri text corpus
## Technical Specifications
- **Framework**: PyTorch
- **Input**: Text prompts in Kashmiri
- **Output**: Generated text continuation
- **Model Parameters**:
- Embedding Dimension: Specified in `model_config.json`
- Number of Layers: Specified in `model_config.json`
- Number of Attention Heads: Specified in `model_config.json`
- Sequence Length: Specified in `model_config.json`
- Dropout Rate: 0.2
## Files Structure
```
├── root /
│ ├── model.pt # Trained model weights
│ ├── word_to_int.json # Word to integer mapping
│ ├── int_to_word.json # Integer to word mapping
│ └── model_config.json # Model configuration
```
## NOTE
1. Ensure all required files are present in the root directory
## Setup in Google Colab
1. Create a new Google Colab notebook
2. Copy and paste the following code into a code cell:
```python
!git clone https://huggingface.co/Omarrran/Kashur_gpt_version_1
```
## Required Files
The model requires the following files which will be downloaded from the HuggingFace repository:
- `model.pt`: The trained model weights
- `model_config.json`: Model configuration parameters
- `word_to_int.json`: Vocabulary mapping from words to integers
- `int_to_word.json`: Vocabulary mapping from integers to words
## NOTE
1. Ensure all required files are present in the root directory
```
import os
import shutil
# Define the source and destination paths
source_path = "/content/Kashur_gpt_version_1/"
destination_path = "/content/"
# Loop through all files in the source directory and move them to the destination
for filename in os.listdir(source_path):
file_path = os.path.join(source_path, filename)
if os.path.isfile(file_path):
shutil.move(file_path, destination_path)
print(f"All files from {source_path} moved to {destination_path}")
```
## Usage
### 1. Import Required Libraries
```python
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
import json
import os
```
# 2. Device configuration
```python
# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
def generate_square_subsequent_mask(sz):
mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
return mask
class PositionalEncoding(nn.Module):
def __init__(self, max_len, d_model, dropout=0.1):
super(PositionalEncoding, self).__init__()
self.dropout = nn.Dropout(p=dropout)
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0)
self.register_buffer('pe', pe)
def forward(self, x):
x = x + self.pe[:, :x.size(1)]
return self.dropout(x)
class TextGen(nn.Module):
def __init__(self, vocab_size, embed_dim, num_layers, num_heads, sequence_length):
super(TextGen, self).__init__()
self.pos_encoder = PositionalEncoding(max_len=sequence_length, d_model=embed_dim)
self.emb = nn.Embedding(vocab_size, embed_dim)
self.decoder_layer = nn.TransformerDecoderLayer(d_model=embed_dim, nhead=num_heads, batch_first=True)
self.decoder = nn.TransformerDecoder(decoder_layer=self.decoder_layer, num_layers=num_layers)
self.linear = nn.Linear(embed_dim, vocab_size)
self.dropout = nn.Dropout(0.2)
def forward(self, x):
emb = self.emb(x)
input_mask = generate_square_subsequent_mask(x.size(1)).to(x.device)
x = self.pos_encoder(emb)
x = self.decoder(x, memory=x, tgt_mask=input_mask, memory_mask=input_mask)
x = self.dropout(x)
out = self.linear(x)
return out
def load_model():
# Load configuration
with open('model_config.json', 'r') as f:
config = json.load(f)
# Load vocabularies
with open('word_to_int.json', 'r', encoding='utf-8') as f:
word_to_int = json.load(f)
with open('int_to_word.json', 'r', encoding='utf-8') as f:
int_to_word = json.load(f)
# Initialize model
model = TextGen(
vocab_size=config['vocab_size'],
embed_dim=config['embed_dim'],
num_layers=config['num_layers'],
num_heads=config['num_heads'],
sequence_length=config['sequence_length']
).to(device)
# Load model weights
model.load_state_dict(torch.load('model.pt', map_location=device))
model.eval()
return model, word_to_int, int_to_word, config['sequence_length']
@torch.no_grad()
def generate_text(model, prompt, word_to_int, int_to_word, sequence_length, max_length=100, temperature=1.0):
model.eval()
words = prompt.split()
current_seq = torch.LongTensor([word_to_int.get(w, 0) for w in words]).unsqueeze(0).to(device)
for _ in range(max_length):
if current_seq.size(1) > sequence_length:
current_seq = current_seq[:, -sequence_length:]
output = model(current_seq)
next_token_logits = output[:, -1, :] / temperature
next_token = torch.multinomial(F.softmax(next_token_logits, dim=-1), num_samples=1)
current_seq = torch.cat([current_seq, next_token], dim=1)
next_word = int_to_word.get(str(next_token.item()), "<UNK>")
words.append(next_word)
if next_word == ".":
break
return " ".join(words)
if __name__ == "__main__":
# Load model and required files
model, word_to_int, int_to_word, sequence_length = load_model()
```
### Load the Model
The model will automatically load after running the provided code above. It uses either GPU (if available) or CPU.
### 3. Generate Text
To generate text, use the following format:
```python
# Example prompt (in Kashmiri)
prompt = " دِتم مصمت۔یم بگُل غلام چھُ آں تس اکھ حمزہ گویی" # Replace With your Kashmiri text prompt
generated_text = generate_text(
model,
prompt,
word_to_int,
int_to_word,
sequence_length,
max_length=100 # Adjust this value to control output length
)
print(f"Generated text: {generated_text}")
```
## Parameters
You can adjust the following parameters for text generation:
- `max_length`: Maximum number of words to generate (default: 100)
- `temperature`: Controls randomness in generation (default: 1.0)
- Higher values (>1.0) make the output more random
- Lower values (<1.0) make the output more focused and deterministic
## Generation Parameters
- **Temperature**: Controls randomness in generation (default: 1.0)
- Higher values (>1.0) result in more diverse outputs
- Lower values (<1.0) make the output more deterministic
- **Max Length**: Maximum number of tokens to generate (default: 100)
- **Sequence Length**: Maximum context window size (specified in config)
## Limitations
- The model operates at word-level tokenization
- Limited by the maximum sequence length specified in the configuration
- Generation stops at the first period (.) encountered
- Performance may vary based on input prompt quality and length
## Performance Considerations
- Runs on both CPU and CUDA-enabled GPUs
- Memory usage scales with sequence length and batch size
- Inference speed depends on hardware capabilities and generation parameters
## Dependencies
- Python 3.6+
- PyTorch
- Math
- JSON
- OS
## License
[See above card]
## Citation
If you use this model in your research, please cite:
```bibtex
@misc{{kashmiri_text_gen,
author = {{Haq Nawaz Malik}},
title = {{Kashmiri Text Generation Model}},
year = {{2024}},
journal = {{for Preprint}},
howpublished = {{\\url{{https://huggingface.co/Omarrran/kashmiri_text_gen_model}}}}
}}
```
## Contact
[Add contact information for model maintainers]
## Updates and Maintenance
- Version: 1.0
- Last Updated: [26-10-2024]
- [Working to make an updated version]