Test network using differential attention instead of classical attention. Other than some alterations to the attention, this is otherwise the same configuration as https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct


# Training Metrics

## Dataset Information
- Training data per epoch: 1 GB
- Total tokens trained: 48,261,120
- No sythetic data

## Training Results
- Final Train Loss: 2.6883
- Final Train Perplexity: 14.71

![image/png](https://cdn-uploads.huggingface.co/production/uploads/637f3b03932a61b89aefbf5c/Eu8hsPYrKQqFvt-54_AkY.png)