File size: 1,131 Bytes

a1c1daf
9cea78c
ceae081
aa6c58d
 
ceae081
153ca6d
117fe8f
153ca6d
cba2e4e
9004ce2
21f6a61
9004ce2
 
 
 
21f6a61
9004ce2
0512a5a
 
cba2e4e

Test network using differential attention instead of classical attention (using nope). Other than some alterations to the attention, this is otherwise the same configuration as https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct

# Scripts:
- `inference.py` to run the model with some test prompts
- `test_train.py` runs with the exact configurations used to train this model and is the reproduction script. Data is assumed to be in JSONL format with `"text":"example text", "text":"..."`

# Notes:
Compared to the control model of Smollm2, this is bordering on incoherent. Potentially this model size is too small to correctly leverage differential attention. It's clearly picked up on some ideas in language, but is generally worse than the control model using GQA in terms of human output.


# Training Metrics

## Dataset Information
- Training data per epoch: 1 GB
- Total tokens trained: 48,261,120
- No sythetic data

## Training Results
- Final Train Loss: 2.6883
- Final Train Perplexity: 14.71

![image/png](https://cdn-uploads.huggingface.co/production/uploads/637f3b03932a61b89aefbf5c/Eu8hsPYrKQqFvt-54_AkY.png)