asas-ai (ASAS AI)

yazeed7

authored a paper 8 days ago

Command A: An Enterprise-Ready Large Language Model

Paper • 2504.00698 • Published 9 days ago • 23

mmhamdy

posted an update 11 days ago

Post

1547

What inspired the Transformer architecture in the "Attention Is All You Need" paper? And how were various ideas combined to create this groundbreaking model?

In this lengthy article, I explore the story and the origins of some of the ideas introduced in the paper. We'll explore everything from the fundamental attention mechanism that lies at its heart to the surprisingly simple explanation for its name, Transformer.

💡 Examples of ideas explored in the article:

✅ What was the inspiration for the attention mechanism?
✅ How did we go from attention to self-attention?
✅ Did the team have any other names in mind for the model?

and more...

I aim to tell the story of Transformers as I would have wanted to read it, and hopefully, one that appeals to others interested in the details of this fascinating idea. This narrative draws from video interviews, lectures, articles, tweets/Xs, and some digging into the literature. I have done my best to be accurate, but errors are possible. If you find inaccuracies or have any additions, please do reach out, and I will gladly make the necessary updates.

Read the article: https://huggingface.co/blog/mmhamdy/pandemonium-the-transformers-story

Benyou

authored a paper 13 days ago

Video-R1: Reinforcing Video Reasoning in MLLMs

Paper • 2503.21776 • Published 14 days ago • 76

Benyou

authored a paper about 1 month ago

S2S-Arena, Evaluating Speech2Speech Protocols on Instruction Following with Paralinguistic Information

Paper • 2503.05085 • Published Mar 7 • 47

khalidalt

updated a model about 1 month ago

asas-ai/OpenR1-AceGPT-v2-8B-SFT

Text Generation • Updated Feb 25 • 23

mmhamdy

posted an update about 2 months ago

Post

2753

🎉 We're excited to introduce MemoryCode, a novel synthetic dataset designed to rigorously evaluate LLMs' ability to track and execute coding instructions across multiple sessions. MemoryCode simulates realistic workplace scenarios where a mentee (the LLM) receives coding instructions from a mentor amidst a stream of both relevant and irrelevant information.

💡 But what makes MemoryCode unique?! The combination of the following:

✅ Multi-Session Dialogue Histories: MemoryCode consists of chronological sequences of dialogues between a mentor and a mentee, mirroring real-world interactions between coworkers.

✅ Interspersed Irrelevant Information: Critical instructions are deliberately interspersed with unrelated content, replicating the information overload common in office environments.

✅ Instruction Updates: Coding rules and conventions can be updated multiple times throughout the dialogue history, requiring LLMs to track and apply the most recent information.

✅ Prospective Memory: Unlike previous datasets that cue information retrieval, MemoryCode requires LLMs to spontaneously recall and apply relevant instructions without explicit prompts.

✅ Practical Task Execution: LLMs are evaluated on their ability to use the retrieved information to perform practical coding tasks, bridging the gap between information recall and real-world application.

📌 Our Findings

1️⃣ While even small models can handle isolated coding instructions, the performance of top-tier models like GPT-4o dramatically deteriorates when instructions are spread across multiple sessions.

2️⃣ This performance drop isn't simply due to the length of the context. Our analysis indicates that LLMs struggle to reason compositionally over sequences of instructions and updates. They have difficulty keeping track of which instructions are current and how to apply them.

🔗 Paper: From Tools to Teammates: Evaluating LLMs in Multi-Session Coding Interactions (2502.13791)
📦 Code: https://github.com/for-ai/MemoryCode

mmhamdy

authored 2 papers about 2 months ago

MMTEB: Massive Multilingual Text Embedding Benchmark

Paper • 2502.13595 • Published Feb 19 • 33

From Tools to Teammates: Evaluating LLMs in Multi-Session Coding Interactions

Paper • 2502.13791 • Published Feb 19 • 5

alielfilali01

posted an update about 2 months ago

Post

920

🚨 Arabic LLM Evaluation 🚨

Few models join the ranking of https://huggingface.co/spaces/inceptionai/AraGen-Leaderboard Today.

The new MistralAI model, Saba, is quite impressive, Top10 ! Well done @arthurmensch and team.

Sadly Mistral did not follow its strategy about public weights this time, we hope this changes soon and we get the model with a permissive license.

We added other Mistral models and apparently, we have been sleeping on mistralai/Mistral-Large-Instruct-2411 !

Another impressive model that joined the ranking today is ALLaM-AI/ALLaM-7B-Instruct-preview. After a long wait finally ALLaM is here and it is IMPRESSIVE given its size !

ALLaM is ranked on OALL/Open-Arabic-LLM-Leaderboard as well.

Benyou

authored a paper about 2 months ago

Soundwave: Less is More for Speech-Text Alignment in LLMs

Paper • 2502.12900 • Published Feb 18 • 84

khalidalt

updated a model about 2 months ago

asas-ai/AceGPT-v2-8b-chat-Open-R1-Distill

Text Generation • Updated Feb 12 • 7

khalidalt

published a model about 2 months ago

asas-ai/AceGPT-v2-8b-chat-Open-R1-Distill

Text Generation • Updated Feb 12 • 7

khalidalt

updated a model about 2 months ago

asas-ai/Llama-3.1-8B-Instruct-Open-R1-Distill

Text Generation • Updated Feb 12 • 8

khalidalt

published a model about 2 months ago

asas-ai/Llama-3.1-8B-Instruct-Open-R1-Distill

Text Generation • Updated Feb 12 • 8

khalidalt

updated a model about 2 months ago

asas-ai/Llama-3.2-1B-Open-R1-Distill

Text Generation • Updated Feb 12 • 9

mmhamdy

posted an update about 2 months ago

Post

2982

⛓ Evaluating Long Context #2: SCROLLS and ZeroSCROLLS

In this series of posts about tracing the history of long context evaluation, we started with Long Range Arena (LRA). Introduced in 2020, Long Range Arens (LRA) is one of the earliest benchmarks designed to tackle the challenge of long context evaluation. But it wasn't introduced to evaluate LLMs, but rather the transformer architecture in general.

📜 The SCROLLS benchmark, introduced in 2022, addresses this gap in NLP/LLM research. SCROLLS challenges models with tasks that require reasoning over extended sequences (according to 2022 standards). So, what does it offer?

1️⃣ Long Text Focus: SCROLLS (unlike LRA) focus mainly on text and contain inputs with thousands of words, testing models' ability to synthesize information across lengthy documents.
2️⃣ Diverse Tasks: Includes summarization, question answering, and natural language inference across domains like literature, science, and business.
3️⃣ Unified Format: All datasets are available in a text-to-text format, facilitating easy evaluation and comparison of models.

Building on SCROLLS, ZeroSCROLLS takes long text evaluation to the next level by focusing on zero-shot learning. Other features include:

1️⃣ New Tasks: Introduces tasks like sentiment aggregation and sorting book chapter summaries.
2️⃣ Leaderboard: A live leaderboard encourages continuous improvement and competition among researchers.

💡 What are some other landmark benchmarks in the history of long context evaluation? Feel free to share your thoughts and suggestions in the comments.

- SCROLLS Paper: SCROLLS: Standardized CompaRison Over Long Language Sequences (2201.03533)
- ZeroSCROLLS Paper: ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding (2305.14196)

Benyou

authored 2 papers 2 months ago

TwinMarket: A Scalable Behavioral and Social Simulation for Financial Markets

Paper • 2502.01506 • Published Feb 3 • 38

RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques

Paper • 2501.14492 • Published Jan 24 • 34

Benyou

authored a paper 3 months ago

Enabling Scalable Oversight via Self-Evolving Critic

Paper • 2501.05727 • Published Jan 10 • 75

alielfilali01

posted an update 3 months ago

Post

2076

3C3H AraGen Leaderboard welcomes today deepseek-ai/DeepSeek-V3 and 12 other models (including the late gpt-3.5 💀) to the ranking of best LLMs in Arabic !

Observations:
- DeepSeek-v3 ranked 3rd and only Open model among the top 5 !

- A 14B open model ( Qwen/Qwen2.5-14B-Instruct) outperforms gpt-3.5-turbo-0125 (from last year). This shows how much we came in advancing and supporting Arabic presence within the LLM ecosystem !

- Contrary to what observed in likelihood-acc leaderboards (like OALL/Open-Arabic-LLM-Leaderboard) further finetuned models like maldv/Qwentile2.5-32B-Instruct actually decreased the performance compared to the original model Qwen/Qwen2.5-32B-Instruct.
It's worth to note that the decrease is statiscally insignificant which imply that at best, the out-domain finetuning do not really hurts the model original capabilities acquired during pretraining.
Previous work addressed this (finetuning VS pretraining) but more investigation in this regard is required (any PhDs here ? This could be your question ...)

Check out the latest rankings: https://huggingface.co/spaces/inceptionai/AraGen-Leaderboard

ASAS AI

AI & ML interests

Recent Activity

asas-ai's activity

Command A: An Enterprise-Ready Large Language Model

Video-R1: Reinforcing Video Reasoning in MLLMs

S2S-Arena, Evaluating Speech2Speech Protocols on Instruction Following with Paralinguistic Information

asas-ai/OpenR1-AceGPT-v2-8B-SFT

MMTEB: Massive Multilingual Text Embedding Benchmark

From Tools to Teammates: Evaluating LLMs in Multi-Session Coding Interactions

Soundwave: Less is More for Speech-Text Alignment in LLMs

asas-ai/AceGPT-v2-8b-chat-Open-R1-Distill

asas-ai/AceGPT-v2-8b-chat-Open-R1-Distill

asas-ai/Llama-3.1-8B-Instruct-Open-R1-Distill

asas-ai/Llama-3.1-8B-Instruct-Open-R1-Distill

asas-ai/Llama-3.2-1B-Open-R1-Distill

TwinMarket: A Scalable Behavioral and Social Simulation for Financial Markets

RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques

Enabling Scalable Oversight via Self-Evolving Critic

AI & ML interests

Recent Activity

Team members 51

asas-ai's activity