Huang Liang Hsun's picture

Huang Liang Hsun PRO

lianghsun

AI & ML interests

Founder of Twinkle AI. Focused on applying deep learning in legal and scientific domains, with expertise in NLP and model fine-tuning.

Recent Activity

updated a dataset 4 days ago
twinkle-ai/llama-4-eval-logs-and-scores
published a dataset 4 days ago
twinkle-ai/llama-4-eval-logs-and-scores
updated a dataset 4 days ago
lianghsun/tw-textbook-dpo
View all activity

Organizations

shareAI's profile picture Hugging Face for Legal's profile picture Model Collapse's profile picture Taiwan Llama's profile picture Twinkle AI's profile picture

lianghsun's activity

replied to their post 8 days ago
view reply

lol thanks! I’ve always wondered why HF posts don’t support markdown.

posted an update 9 days ago
view post
Post
2116

With the arrival of Twinkle April — Twinkle AI’s annual open-source celebration held every April — our community is excited to unveil its very first project:

📊 Twinkle Eval (https://github.com/ai-twinkle/Eval), a next-generation evaluation tool led by our contributor @tedslin .

Unlike traditional evaluation tools like iKala’s ievals (https://github.com/ikala-ai/ievals), which can only evaluate language models (LMs) one sample at a time, Twinkle Eval is designed with Large Reasoning Models (LRMs) in mind. As reasoning time increases with more complex models, traditional tools become increasingly inefficient 😲 — for example, evaluating LRMs on the ikala/tmmluplus benchmark could take *
half a day without finishing.

One question we were especially curious about:
Does shuffling multiple-choice answer order impact model accuracy? 🤔
→ See: "Change Answer Order Can Decrease MMLU Accuracy" – arXiv:2406.19470v1

To address these challenges, Twinkle Eval brings three key innovations to the table:

1️⃣ Parallelized evaluation of samples
2️⃣ Multi-round testing for stability
3️⃣ Randomized answer order to test robustness

After running experiments, we observed that Twinkle Eval can speed up evaluation by up to 15× 🚀🚀. Interestingly, most models scored slightly lower under the 2️⃣3️⃣ test settings compared to their claimed performance — suggesting further benchmarking is needed.

This framework also comes with additional tunable parameters and detailed logging of LM behavior per question — perfect for those who want to dive deeper. 😆

If you find Twinkle Eval useful, please ⭐ the project and help spread the word 🤗
·
updated a Space 11 days ago