Huang Liang Hsun PRO

lianghsun

https://www.lianghsun.dev

AI & ML interests

Founder of Twinkle AI. Focused on applying deep learning in legal and scientific domains, with expertise in NLP and model fine-tuning.

Recent Activity

updated a dataset 4 days ago

twinkle-ai/llama-4-eval-logs-and-scores

published a dataset 4 days ago

twinkle-ai/llama-4-eval-logs-and-scores

updated a dataset 4 days ago

lianghsun/tw-textbook-dpo

View all activity

Organizations

lianghsun's activity

updated a dataset 4 days ago

twinkle-ai/llama-4-eval-logs-and-scores

Viewer • Updated 4 days ago • 750 • 28 • 1

published a dataset 4 days ago

twinkle-ai/llama-4-eval-logs-and-scores

Viewer • Updated 4 days ago • 750 • 28 • 1

updated a dataset 4 days ago

lianghsun/tw-textbook-dpo

Viewer • Updated 4 days ago • 35.2k • 39

updated a model 6 days ago

twinkle-ai/Llama-3.2-3B-F1-Instruct

Text Generation • Updated 3 days ago • 12 • 20

updated a dataset 7 days ago

twinkle-ai/tw-reasoning-instruct-50k

Updated 7 days ago • 24 • 1

replied to their post 8 days ago

lol thanks! I’ve always wondered why HF posts don’t support markdown.

updated a collection 9 days ago

🏎️ Formosa-1 Series

Collection

A collection of Formosa-1 (F1) reasoning models and datasets focused on Traditional Chinese instruction-following and logic. • 1 item • Updated 9 days ago

posted an update 9 days ago

Post

2116

With the arrival of Twinkle April — Twinkle AI’s annual open-source celebration held every April — our community is excited to unveil its very first project:

📊 Twinkle Eval (https://github.com/ai-twinkle/Eval), a next-generation evaluation tool led by our contributor @tedslin .

Unlike traditional evaluation tools like iKala’s ievals (https://github.com/ikala-ai/ievals), which can only evaluate language models (LMs) one sample at a time, Twinkle Eval is designed with Large Reasoning Models (LRMs) in mind. As reasoning time increases with more complex models, traditional tools become increasingly inefficient 😲 — for example, evaluating LRMs on the ikala/tmmluplus benchmark could take *
half a day without finishing.

One question we were especially curious about:
Does shuffling multiple-choice answer order impact model accuracy? 🤔
→ See: "Change Answer Order Can Decrease MMLU Accuracy" – arXiv:2406.19470v1

To address these challenges, Twinkle Eval brings three key innovations to the table:

1️⃣ Parallelized evaluation of samples
2️⃣ Multi-round testing for stability
3️⃣ Randomized answer order to test robustness

After running experiments, we observed that Twinkle Eval can speed up evaluation by up to 15× 🚀🚀. Interestingly, most models scored slightly lower under the 2️⃣3️⃣ test settings compared to their claimed performance — suggesting further benchmarking is needed.

This framework also comes with additional tunable parameters and detailed logging of LM behavior per question — perfect for those who want to dive deeper. 😆

If you find Twinkle Eval useful, please ⭐ the project and help spread the word 🤗