7 4 22

Sarath Shekkizhar

sarath-shekkizhar

https://shekkizh.github.io/

AI & ML interests

None yet

Recent Activity

posted an update 8 days ago

Some interesting architectural choices made in Llama 4 models -- were these key to the 10M context? Possibly 🤔 🔍 Takeaways: 🧩 Interleaved Attention without position encoding - LLaMA 4 removes explicit positional encoding in some attention layers to boost performance on longer contexts. - The principles here could be similar to the residual connections to facilitate attention to early tokens without positional decay. ⚖️ Scaled Softmax to increase attention at inference time - The max attention value (output of softmax) decreases as context size increases. - Llama 4 incorporates a context-size dependent temperature in the softmax function to modify the slope of softmax, allowing the model to focus better on relevant tokens. - Done only at inference time -- guessing it was more a choice after some observation on eval datasets. What did you think of these choices?

updated a Space 9 months ago

tenyx/TenyxChat-7B-v1

updated a Space 9 months ago

tenyx/TenyxChat-8x7B-v1

View all activity

Organizations

sarath-shekkizhar's activity

posted an update 8 days ago

Post

580

Some interesting architectural choices made in Llama 4 models -- were these key to the 10M context? Possibly 🤔

🔍 Takeaways:
🧩 Interleaved Attention without position encoding
- LLaMA 4 removes explicit positional encoding in some attention layers to boost performance on longer contexts.
- The principles here could be similar to the residual connections to facilitate attention to early tokens without positional decay.

⚖️ Scaled Softmax to increase attention at inference time
- The max attention value (output of softmax) decreases as context size increases.
- Llama 4 incorporates a context-size dependent temperature in the softmax function to modify the slope of softmax, allowing the model to focus better on relevant tokens.
- Done only at inference time -- guessing it was more a choice after some observation on eval datasets.

What did you think of these choices?

updated 2 Spaces 9 months ago

TenyxChat 7B V1

🐠

TenyxChat 8x7B V1

🐋

liked a dataset 10 months ago

nvidia/HelpSteer2

Viewer • Updated Dec 18, 2024 • 21.4k • 4.14k • 411

New activity in openbmb/RLAIF-V-Dataset 11 months ago

Dataset loading failing with HF load_dataset

#3 opened 11 months ago by

sarath-shekkizhar

liked 4 datasets 11 months ago

reacted to their post with 🚀 11 months ago

Post

1226

Hi folks,
Tenyx announced its latest model Llama3-TenyxChat-70B, which outperforms a GPT-4 variant on several MT-Bench measurements.

By post-training Llama-3 70B in 15 hours, our model improves reasoning capabilities leveraging the relationship between geometry and LLM task complexity (Take a look at our paper: https://arxiv.org/abs/2312.01648, to be presented at ICML 2024)
Model: tenyx/Llama3-TenyxChat-70B, HuggingFace Space: tenyx/Llama3-TenyxChat-70B