Crispin Almodovar's picture
4 6

Crispin Almodovar

calmodovar
·

AI & ML interests

NLP, log anomaly detection, cyber intelligence

Recent Activity

Organizations

logfit-project's profile picture test012345's profile picture

calmodovar's activity

reacted to merve's post with 🔥 12 days ago
view post
Post
4731
supercharge your LLM apps with smolagents 🔥

however cool your LLM is, without being agentic it can only go so far

enter smolagents: a new agent library by Hugging Face to make the LLM write code, do analysis and automate boring stuff!

Here's our blog for you to get started https://huggingface.co/blog/smolagents
reacted to MoritzLaurer's post with đź‘Ť 23 days ago
view post
Post
2597
Quite excited by the ModernBERT release! 0.15/0.4B small, 2T modern pre-training data and tokenizer with code, 8k context window, great efficient model for embeddings & classification!

This will probably be the basis for many future SOTA encoders! And I can finally stop using DeBERTav3 from 2021 :D

Congrats @answerdotai , @LightOnIO and collaborators like @tomaarsen !

Paper and models here 👇https://huggingface.co/collections/answerdotai/modernbert-67627ad707a4acbf33c41deb
·
upvoted an article 2 months ago
view article
Article

Visually Multilingual: Introducing mcdse-2b

By marco •
• 37
reacted to singhsidhukuldeep's post with đź‘€ 3 months ago
view post
Post
2163
While Google's Transformer might have introduced "Attention is all you need," Microsoft and Tsinghua University are here with the DIFF Transformer, stating, "Sparse-Attention is all you need."

The DIFF Transformer outperforms traditional Transformers in scaling properties, requiring only about 65% of the model size or training tokens to achieve comparable performance.

The secret sauce? A differential attention mechanism that amplifies focus on relevant context while canceling out noise, leading to sparser and more effective attention patterns.

How?
- It uses two separate softmax attention maps and subtracts them.
- It employs a learnable scalar λ for balancing the attention maps.
- It implements GroupNorm for each attention head independently.
- It is compatible with FlashAttention for efficient computation.

What do you get?
- Superior long-context modeling (up to 64K tokens).
- Enhanced key information retrieval.
- Reduced hallucination in question-answering and summarization tasks.
- More robust in-context learning, less affected by prompt order.
- Mitigation of activation outliers, opening doors for efficient quantization.

Extensive experiments show DIFF Transformer's advantages across various tasks and model sizes, from 830M to 13.1B parameters.

This innovative architecture could be a game-changer for the next generation of LLMs. What are your thoughts on DIFF Transformer's potential impact?
  • 1 reply
·
upvoted an article 6 months ago
view article
Article

Fine-tune Llama 3.1 Ultra-Efficiently with Unsloth

By mlabonne •
• 262
upvoted an article 8 months ago
view article
Article

Multimodal Augmentation for Documents: Recovering “Comprehension” in “Reading and Comprehension” task

By danaaubakirova •
• 17