CRLannister
commited on
Commit
•
3a2a11b
1
Parent(s):
24ba59d
Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
@@ -1,3 +1,50 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Neural Network-Based Language Model for Next Token Prediction
|
2 |
+
|
3 |
+
## Overview
|
4 |
+
This project is a midterm assignment focused on developing a neural network-based language model for next token prediction. The model was trained using a custom dataset with two languages, English and Amharic. The project incorporates techniques in neural networks to predict the next token in a sequence, demonstrating a non-transformer approach to language modeling.
|
5 |
+
|
6 |
+
## Project Objectives
|
7 |
+
The main objective of this project was to:
|
8 |
+
- Develop a neural network-based model for next token prediction without using transformers or encoder-decoder architectures.
|
9 |
+
- Experiment with multiple languages to observe model performance.
|
10 |
+
- Implement checkpointing to save model progress and generate text during different training stages.
|
11 |
+
- Present a video demo showcasing the model's performance in generating text in both English and Amharic.
|
12 |
+
|
13 |
+
## Project Details
|
14 |
+
|
15 |
+
### 1. Training Languages
|
16 |
+
The model was trained using datasets in English and Amharic. The datasets were cleaned and prepared, including tokenization and embedding for improved model training.
|
17 |
+
|
18 |
+
### 2. Tokenizer
|
19 |
+
A custom tokenizer was created using Byte Pair Encoding (BPE). This tokenizer was trained on five languages: English, Amharic, Sanskrit, Nepali, and Hindi, but the model specifically utilized English and Amharic for this task.
|
20 |
+
|
21 |
+
### 3. Embedding Model
|
22 |
+
A custom embedding model was employed to convert tokens into vector representations, allowing the neural network to better understand the structure and meaning of the input data.
|
23 |
+
|
24 |
+
### 4. Model Architecture
|
25 |
+
The project uses an LSTM (Long Short-Term Memory) neural network to predict the next token in a sequence. LSTMs are well-suited for sequential data and are a popular choice for language modeling due to their ability to capture long-term dependencies.
|
26 |
+
|
27 |
+
## Results and Evaluation
|
28 |
+
|
29 |
+
### Training Curve and Loss
|
30 |
+
The model’s training and validation loss over time are documented and included in the repository (`loss_values.csv`). The training curve demonstrates the model's learning progress, with explanations provided for key observations in the loss trends.
|
31 |
+
|
32 |
+
### Checkpoint Implementation
|
33 |
+
Checkpointing was implemented to save model states at different training stages, allowing for partial model evaluations and text generation demos. Checkpoints are included in the repository for reference.
|
34 |
+
|
35 |
+
### Perplexity Score
|
36 |
+
The model's perplexity score, calculated during training, is available in the `perplexity.csv` file. This score provides an indication of the model's predictive accuracy over time.
|
37 |
+
|
38 |
+
## Demonstration
|
39 |
+
A video demo, linked below, demonstrates:
|
40 |
+
- Random initialization text generation in English.
|
41 |
+
- Text generation using the trained model in both English and Amharic, with English translations provided using Google Translate.
|
42 |
+
|
43 |
+
**Video Demo Link:** [YouTube Demo](https://youtu.be/1m21NYmLSC4)
|
44 |
+
|
45 |
+
## Instructions for Reproducing the Results
|
46 |
+
1. Install dependencies (Python, PyTorch, and other required libraries).
|
47 |
+
2. Load the .ipynb notebook and run cells sequentially to replicate training and evaluation.
|
48 |
+
3. Refer to HuggingFace documentation for downloading the model and tokenizer files.
|
49 |
+
|
50 |
+
|