saiful9379's picture
Update README.md
c379451
|
raw
history blame
3.59 kB
metadata
license: mit
pipeline_tag: translation

Bangla-to-English-Translation

This model has been trained using an encoder-decoder with attention mechanism to translate Bangla to English. The repository containing this model could be a useful starting point for those working on Bangla machine translation. The dataset used for training consists of 195,775 sentence pairs of Bangla and English. Both languages were tokenized using sentencepiece tokenizer, and the vocabulary size was defined as 30,000.

After training on a single 24GB GPU for 24 hours, this model achieved a BLEU score of 0.9570595683021429 on the entire dataset of 195,775 sentences.

Paper link of Transformer model : Attention Is All You Need Github link

Dataset

We have collected 195775 number of sentence pair Bangla to English. we seperated bangla and english sentence using \t. Data representation into txt file:

তারা হলের প্রোভস্টের বাড়িতেও প্রবেশ করে	They enter the house of the provost of the hall 
তার ঘোষণা মতে ২ আগস্ট থেকে ক্লাস শুরুর কথা ছিল	According to his announcement, the class was to start from August 2 
তখন সকল পরীক্ষা স্থগিত ছিল	Then all the tests were suspended 
ঐ সময় ক্লাসে ছাত্রদের উপস্থিতি ছিল খুব কম	Attendance of students in the class was very low at that time 
প্রতিদিনই প্রায় কলাভবনে গ্রেনেড বিস্ফোরন হত	Grenades exploded in Kalabhavan almost every day 
রাউলিং বলেন  তাঁর সবসময়ই মনে হয় যে ডাম্বলডোর সমকামী	Rowling said he always thinks Dumbledore is gay 
তিনি গেলার্ট গ্রিন্ডেলওয়াল্ডের প্রেমে পড়েছিলেন	He fell in love with Gelart Grindelwald 
তিনি ব্রহ্মচর্য ও পুথিগত জীবনকেই বেছে নেন	He chose celibacy and bookish life 

Available Public dataset: I use dataset provide in http://www.manythings.org/anki/ben-eng.zip . This dataset contain english bangla sentence pair in the following format.

Train Sentencepice Tokenizer:

import sentencepiece as spm
text_path = "dataset.txt"
vocab_size = 30000
spm.SentencePieceTrainer.train(f'--input={text_path} --model_prefix={model_prefix} --user_defined_symbols=<sep>,<cls> --vocab_size={vocab_size}')
bn_sp = spm.SentencePieceProcessor()
bn_sp.load(os.path.join(model_path, 'bn_model.model'))

Configuration

Here the training configuration of Transformer model and it's additional parameters:

-epochs : 300
-batch_size : 150
-n_layers : 6 
-heads : 8
-d_model :512
-dropout' : 0.1
-lr : 0.0001

Testing Total Data[195775]: BLUE Score : 0.9570595683021429

Experiment Machine Configuration,

NVIDIA Driver Version: 470.161.03
CUDA Version: 11.2
NVIDIA GeForce 3090: 24 GB

References

  1. https://torchtutorialstaging.z5.web.core.windows.net/beginner/translation_transformer.html
  2. https://arusl.medium.com/japanese-english-language-translation-with-transformer-using-pytorch-243738146806
  3. https://github.com/hyunwoongko/transformer
  4. https://www.kaggle.com/datasets/ari994/banglaenglishtransliteration