CRLannister commited on
Commit
3a2a11b
1 Parent(s): 24ba59d

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +50 -3
README.md CHANGED
@@ -1,3 +1,50 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Neural Network-Based Language Model for Next Token Prediction
2
+
3
+ ## Overview
4
+ This project is a midterm assignment focused on developing a neural network-based language model for next token prediction. The model was trained using a custom dataset with two languages, English and Amharic. The project incorporates techniques in neural networks to predict the next token in a sequence, demonstrating a non-transformer approach to language modeling.
5
+
6
+ ## Project Objectives
7
+ The main objective of this project was to:
8
+ - Develop a neural network-based model for next token prediction without using transformers or encoder-decoder architectures.
9
+ - Experiment with multiple languages to observe model performance.
10
+ - Implement checkpointing to save model progress and generate text during different training stages.
11
+ - Present a video demo showcasing the model's performance in generating text in both English and Amharic.
12
+
13
+ ## Project Details
14
+
15
+ ### 1. Training Languages
16
+ The model was trained using datasets in English and Amharic. The datasets were cleaned and prepared, including tokenization and embedding for improved model training.
17
+
18
+ ### 2. Tokenizer
19
+ A custom tokenizer was created using Byte Pair Encoding (BPE). This tokenizer was trained on five languages: English, Amharic, Sanskrit, Nepali, and Hindi, but the model specifically utilized English and Amharic for this task.
20
+
21
+ ### 3. Embedding Model
22
+ A custom embedding model was employed to convert tokens into vector representations, allowing the neural network to better understand the structure and meaning of the input data.
23
+
24
+ ### 4. Model Architecture
25
+ The project uses an LSTM (Long Short-Term Memory) neural network to predict the next token in a sequence. LSTMs are well-suited for sequential data and are a popular choice for language modeling due to their ability to capture long-term dependencies.
26
+
27
+ ## Results and Evaluation
28
+
29
+ ### Training Curve and Loss
30
+ The model’s training and validation loss over time are documented and included in the repository (`loss_values.csv`). The training curve demonstrates the model's learning progress, with explanations provided for key observations in the loss trends.
31
+
32
+ ### Checkpoint Implementation
33
+ Checkpointing was implemented to save model states at different training stages, allowing for partial model evaluations and text generation demos. Checkpoints are included in the repository for reference.
34
+
35
+ ### Perplexity Score
36
+ The model's perplexity score, calculated during training, is available in the `perplexity.csv` file. This score provides an indication of the model's predictive accuracy over time.
37
+
38
+ ## Demonstration
39
+ A video demo, linked below, demonstrates:
40
+ - Random initialization text generation in English.
41
+ - Text generation using the trained model in both English and Amharic, with English translations provided using Google Translate.
42
+
43
+ **Video Demo Link:** [YouTube Demo](https://youtu.be/1m21NYmLSC4)
44
+
45
+ ## Instructions for Reproducing the Results
46
+ 1. Install dependencies (Python, PyTorch, and other required libraries).
47
+ 2. Load the .ipynb notebook and run cells sequentially to replicate training and evaluation.
48
+ 3. Refer to HuggingFace documentation for downloading the model and tokenizer files.
49
+
50
+