Update README.md

796abc2 verified 3 months ago

9.63 kB

	---
	license: apache-2.0
	language:
	- ks
	tags:
	- Text
	---
	# Kashmir Text Generation Model

	## Model Overview
	This is a transformer-based text generation model designed for Kashmiri language text generation. The model uses a decoder-only architecture with positional encoding and self-attention mechanisms.

	## TRY LIVE DEMO ON SPACES
	[VIEW HERE (Click)](https://huggingface.co/spaces/Omarrran/kashmiri_text_generation_trail)


	![image/png](https://cdn-uploads.huggingface.co/production/uploads/66afb3f1eaf3e876595627bf/OY88f69T0yxwODUz7iDK8.png)

	## TRY LIVE DEMO ON SPACES
	[VIEW HERE (Click)](https://huggingface.co/spaces/Omarrran/kashmiri_text_generation_trail)


	## Intended Use
	- Primary Use: Generating coherent Kashmiri text continuations from given prompts
	- Intended Users: Researchers and developers working with Kashmiri language processing
	- Out-of-Scope Uses: Not intended for production deployment without further evaluation

	## Model Architecture
	- Type: Decoder-only Transformer
	- Components:
	- Positional Encoding
	- Embedding Layer
	- Transformer Decoder Layers
	- Linear Output Layer
	- Implementation: PyTorch




	This is a custom transformer-based text generation model for Kashmiri language.

	## Model Details
	- Architecture: Custom Transformer Decoder
	- Vocabulary Size: 36100
	- Embedding Dimension: 256
	- Number of Layers: 4
	- Number of Attention Heads: 8
	- Sequence Length: 64
	- Training Data: Kashmiri text corpus






	## Technical Specifications
	- Framework: PyTorch
	- Input: Text prompts in Kashmiri
	- Output: Generated text continuation
	- Model Parameters:
	- Embedding Dimension: Specified in `model_config.json`
	- Number of Layers: Specified in `model_config.json`
	- Number of Attention Heads: Specified in `model_config.json`
	- Sequence Length: Specified in `model_config.json`
	- Dropout Rate: 0.2

	## Files Structure
	```
	├── root /
	│ ├── model.pt # Trained model weights
	│ ├── word_to_int.json # Word to integer mapping
	│ ├── int_to_word.json # Integer to word mapping
	│ └── model_config.json # Model configuration
	```

	## NOTE
	1. Ensure all required files are present in the root directory
	## Setup in Google Colab

	1. Create a new Google Colab notebook
	2. Copy and paste the following code into a code cell:
	```python
	!git clone https://huggingface.co/Omarrran/Kashur_gpt_version_1
	```

	## Required Files

	The model requires the following files which will be downloaded from the HuggingFace repository:
	- `model.pt`: The trained model weights
	- `model_config.json`: Model configuration parameters
	- `word_to_int.json`: Vocabulary mapping from words to integers
	- `int_to_word.json`: Vocabulary mapping from integers to words


	## NOTE
	1. Ensure all required files are present in the root directory
	```
	import os
	import shutil

	# Define the source and destination paths
	source_path = "/content/Kashur_gpt_version_1/"
	destination_path = "/content/"

	# Loop through all files in the source directory and move them to the destination
	for filename in os.listdir(source_path):
	file_path = os.path.join(source_path, filename)
	if os.path.isfile(file_path):
	shutil.move(file_path, destination_path)

	print(f"All files from {source_path} moved to {destination_path}")
	```

	## Usage

	### 1. Import Required Libraries
	```python
	import torch
	import torch.nn as nn
	import torch.nn.functional as F
	import math
	import json
	import os

	```
	# 2. Device configuration
	```python
	# Device configuration
	device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

	def generate_square_subsequent_mask(sz):
	mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
	mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
	return mask

	class PositionalEncoding(nn.Module):
	def __init__(self, max_len, d_model, dropout=0.1):
	super(PositionalEncoding, self).__init__()
	self.dropout = nn.Dropout(p=dropout)
	pe = torch.zeros(max_len, d_model)
	position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
	div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
	pe[:, 0::2] = torch.sin(position * div_term)
	pe[:, 1::2] = torch.cos(position * div_term)
	pe = pe.unsqueeze(0)
	self.register_buffer('pe', pe)

	def forward(self, x):
	x = x + self.pe[:, :x.size(1)]
	return self.dropout(x)

	class TextGen(nn.Module):
	def __init__(self, vocab_size, embed_dim, num_layers, num_heads, sequence_length):
	super(TextGen, self).__init__()
	self.pos_encoder = PositionalEncoding(max_len=sequence_length, d_model=embed_dim)
	self.emb = nn.Embedding(vocab_size, embed_dim)
	self.decoder_layer = nn.TransformerDecoderLayer(d_model=embed_dim, nhead=num_heads, batch_first=True)
	self.decoder = nn.TransformerDecoder(decoder_layer=self.decoder_layer, num_layers=num_layers)
	self.linear = nn.Linear(embed_dim, vocab_size)
	self.dropout = nn.Dropout(0.2)

	def forward(self, x):
	emb = self.emb(x)
	input_mask = generate_square_subsequent_mask(x.size(1)).to(x.device)
	x = self.pos_encoder(emb)
	x = self.decoder(x, memory=x, tgt_mask=input_mask, memory_mask=input_mask)
	x = self.dropout(x)
	out = self.linear(x)
	return out

	def load_model():
	# Load configuration
	with open('model_config.json', 'r') as f:
	config = json.load(f)

	# Load vocabularies
	with open('word_to_int.json', 'r', encoding='utf-8') as f:
	word_to_int = json.load(f)
	with open('int_to_word.json', 'r', encoding='utf-8') as f:
	int_to_word = json.load(f)

	# Initialize model
	model = TextGen(
	vocab_size=config['vocab_size'],
	embed_dim=config['embed_dim'],
	num_layers=config['num_layers'],
	num_heads=config['num_heads'],
	sequence_length=config['sequence_length']
	).to(device)

	# Load model weights
	model.load_state_dict(torch.load('model.pt', map_location=device))
	model.eval()

	return model, word_to_int, int_to_word, config['sequence_length']

	@torch.no_grad()
	def generate_text(model, prompt, word_to_int, int_to_word, sequence_length, max_length=100, temperature=1.0):
	model.eval()
	words = prompt.split()
	current_seq = torch.LongTensor([word_to_int.get(w, 0) for w in words]).unsqueeze(0).to(device)

	for _ in range(max_length):
	if current_seq.size(1) > sequence_length:
	current_seq = current_seq[:, -sequence_length:]

	output = model(current_seq)
	next_token_logits = output[:, -1, :] / temperature
	next_token = torch.multinomial(F.softmax(next_token_logits, dim=-1), num_samples=1)

	current_seq = torch.cat([current_seq, next_token], dim=1)
	next_word = int_to_word.get(str(next_token.item()), "<UNK>")
	words.append(next_word)

	if next_word == ".":
	break

	return " ".join(words)

	if __name__ == "__main__":
	# Load model and required files
	model, word_to_int, int_to_word, sequence_length = load_model()

	```

	### Load the Model
	The model will automatically load after running the provided code above. It uses either GPU (if available) or CPU.


	### 3. Generate Text
	To generate text, use the following format:
	```python
	# Example prompt (in Kashmiri)
	prompt = " دِتم مصمت۔یم بگُل غلام چھُ آں تس اکھ حمزہ گویی" # Replace With your Kashmiri text prompt

	generated_text = generate_text(
	model,
	prompt,
	word_to_int,
	int_to_word,
	sequence_length,
	max_length=100 # Adjust this value to control output length
	)
	print(f"Generated text: {generated_text}")
	```

	## Parameters

	You can adjust the following parameters for text generation:
	- `max_length`: Maximum number of words to generate (default: 100)
	- `temperature`: Controls randomness in generation (default: 1.0)
	- Higher values (>1.0) make the output more random
	- Lower values (<1.0) make the output more focused and deterministic


	## Generation Parameters
	- Temperature: Controls randomness in generation (default: 1.0)
	- Higher values (>1.0) result in more diverse outputs
	- Lower values (<1.0) make the output more deterministic
	- Max Length: Maximum number of tokens to generate (default: 100)
	- Sequence Length: Maximum context window size (specified in config)

	## Limitations
	- The model operates at word-level tokenization
	- Limited by the maximum sequence length specified in the configuration
	- Generation stops at the first period (.) encountered
	- Performance may vary based on input prompt quality and length

	## Performance Considerations
	- Runs on both CPU and CUDA-enabled GPUs
	- Memory usage scales with sequence length and batch size
	- Inference speed depends on hardware capabilities and generation parameters

	## Dependencies
	- Python 3.6+
	- PyTorch
	- Math
	- JSON
	- OS

	## License
	[See above card]


	## Citation

	If you use this model in your research, please cite:

	```bibtex
	@misc{{kashmiri_text_gen,
	author = {{Haq Nawaz Malik}},
	title = {{Kashmiri Text Generation Model}},
	year = {{2024}},
	journal = {{for Preprint}},
	howpublished = {{\\url{{https://huggingface.co/Omarrran/kashmiri_text_gen_model}}}}
	}}
	```

	## Contact
	[Add contact information for model maintainers]

	## Updates and Maintenance
	- Version: 1.0
	- Last Updated: [26-10-2024]
	- [Working to make an updated version]