Test network using differential attention instead of classical attention. Other than some alterations to the attention, this is otherwise the same configuration as https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct # Training Metrics ## Dataset Information - Training data per epoch: 1 GB - Total tokens trained: 48,261,120 - No sythetic data ## Training Results - Final Train Loss: 2.6883 - Final Train Perplexity: 14.71 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/637f3b03932a61b89aefbf5c/Eu8hsPYrKQqFvt-54_AkY.png)