Text classification, Sentiment Analysis, Finance Risk analysis, Text generation, Question answearing, Optimization LLM, Cyber risk analysis, Speech to text
While Google's Transformer might have introduced "Attention is all you need," Microsoft and Tsinghua University are here with the DIFF Transformer, stating, "Sparse-Attention is all you need."
The DIFF Transformer outperforms traditional Transformers in scaling properties, requiring only about 65% of the model size or training tokens to achieve comparable performance.
The secret sauce? A differential attention mechanism that amplifies focus on relevant context while canceling out noise, leading to sparser and more effective attention patterns.
How? - It uses two separate softmax attention maps and subtracts them. - It employs a learnable scalar ฮป for balancing the attention maps. - It implements GroupNorm for each attention head independently. - It is compatible with FlashAttention for efficient computation.
What do you get? - Superior long-context modeling (up to 64K tokens). - Enhanced key information retrieval. - Reduced hallucination in question-answering and summarization tasks. - More robust in-context learning, less affected by prompt order. - Mitigation of activation outliers, opening doors for efficient quantization.
Extensive experiments show DIFF Transformer's advantages across various tasks and model sizes, from 830M to 13.1B parameters.
This innovative architecture could be a game-changer for the next generation of LLMs. What are your thoughts on DIFF Transformer's potential impact?
Researchers from Mila and Borealis AI just have shown that simplified versions of good old Recurrent Neural Networks (RNNs) can match the performance of today's transformers.
They took a fresh look at LSTMs (from 1997!) and GRUs (from 2014). They stripped these models down to their bare essentials, creating "minLSTM" and "minGRU". The key changes: โถ Removed dependencies on previous hidden states in the gates โท Dropped the tanh that had been added to restrict output range in order to avoid vanishing gradients โธ Ensured outputs are time-independent in scale (not sure I understood that well either, don't worry)
โก๏ธ As a result, you can use a โparallel scanโ algorithm to train these new, minimal RNNs, in parallel, taking 88% more memory but also making them 200x faster than their traditional counterparts for long sequences
๐ฅ The results are mind-blowing! Performance-wise, they go toe-to-toe with Transformers or Mamba.
And for Language Modeling, they need 2.5x fewer training steps than Transformers to reach the same performance! ๐
๐ค Why does this matter?
By showing there are simpler models with similar performance to transformers, this challenges the narrative that we need advanced architectures for better performance!
๐ฌย Franรงois Chollet wrote in a tweet about this paper:
โThe fact that there are many recent architectures coming from different directions that roughly match Transformers is proof that architectures aren't fundamentally important in the curve-fitting paradigm (aka deep learning)โ
โCurve-fitting is about embedding a dataset on a curve. The critical factor is the dataset, not the specific hard-coded bells and whistles that constrain the curve's shape.โ
Itโs the Bitter lesson by Rich Sutton striking again: donโt need fancy thinking architectures, just scale up your model and data!
๐SetFit v1.1.0 is out! Training efficient classifiers on CPU or GPU now uses the Sentence Transformers Trainer, and we resolved a lot of issues caused by updates of third-party libraries (like Transformers). Details:
Training a SetFit classifier model consists of 2 phases: 1. Finetuning a Sentence Transformer embedding model 2. Training a Classifier to map embeddings -> classes
๐The first phase now uses the SentenceTransformerTrainer that was introduced in the Sentence Transformers v3 update. This brings some immediate upsides like MultiGPU support, without any (intended) breaking changes.
โก๏ธ Beyond that, we softly deprecated the "evaluation_strategy" argument in favor of "eval_strategy" (following a Transformers deprecation), and deprecated Python 3.7. In return, we add official support for Python 3.11 and 3.12.
โจ There's some more minor changes too, like max_steps and eval_max_steps now being a hard limit instead of an approximate one, training/validation losses now logging nicely in Notebooks, and the "device" parameter no longer being ignored in some situations.
Fast inference is no longer a nice-to-have demo; it will be the driving force behind future frontier models. Time to switch over to custom AI hardware and short Nvidia.
How can I make my RAG application generate real-time responses? Up until now, I have been using Groq for fast LLM generation and the Gradio Live function. I am looking for a better solution that can help me build a real-time application without any delay. @abidlabs