Huang Liang Hsun PRO
lianghsun
AI & ML interests
Founder of Twinkle AI. Focused on applying deep learning in legal and scientific domains, with expertise in NLP and model fine-tuning.
Recent Activity
updated
a model
about 1 hour ago
twinkle-ai/Llama-3.2-3B-F1-Instruct
replied to
their
post
about 18 hours ago
With the arrival of Twinkle April — Twinkle AI’s annual open-source celebration held every April — our community is excited to unveil its very first project:
📊 Twinkle Eval (https://github.com/ai-twinkle/Eval), a next-generation evaluation tool led by our contributor @tedslin .
Unlike traditional evaluation tools like iKala’s ievals (https://github.com/ikala-ai/ievals), which can only evaluate language models (LMs) one sample at a time, Twinkle Eval is designed with Large Reasoning Models (LRMs) in mind. As reasoning time increases with more complex models, traditional tools become increasingly inefficient 😲 — for example, evaluating LRMs on the https://huggingface.co/datasets/ikala/tmmluplus benchmark could take *
half a day without finishing.
One question we were especially curious about:
Does shuffling multiple-choice answer order impact model accuracy? 🤔
→ See: "Change Answer Order Can Decrease MMLU Accuracy" – arXiv:2406.19470v1
To address these challenges, Twinkle Eval brings three key innovations to the table:
1️⃣ Parallelized evaluation of samples
2️⃣ Multi-round testing for stability
3️⃣ Randomized answer order to test robustness
After running experiments, we observed that Twinkle Eval can speed up evaluation by up to 15× 🚀🚀. Interestingly, most models scored slightly lower under the 2️⃣3️⃣ test settings compared to their claimed performance — suggesting further benchmarking is needed.
This framework also comes with additional tunable parameters and detailed logging of LM behavior per question — perfect for those who want to dive deeper. 😆
If you find Twinkle Eval useful, please ⭐ the project and help spread the word 🤗
updated
a collection
1 day ago
🏎️ Formosa-1 Series
Organizations
lianghsun's activity
Upload tokenizer_config.json
#1 opened 15 days ago
by
minyichen

Upload train-00000-of-00001.parquet
#2 opened 16 days ago
by
lianghsun

Upload train-00000-of-00001.parquet
#2 opened 16 days ago
by
lianghsun

Upload tw_instruct_R1_liang.json
#4 opened 19 days ago
by
lianghsun

Upload 3 files
1
#5 opened 21 days ago
by
minyichen

Upload datasets.jsonl
#2 opened 21 days ago
by
minyichen

請問會有更新到 2025 的版本嗎
3
#2 opened 23 days ago
by
lianghsun

Upload identity.json
#4 opened 24 days ago
by
minyichen

Upload 2 files
1
#2 opened 24 days ago
by
minyichen

Question About Benchmark Version in README
1
#9 opened 24 days ago
by
lianghsun

Upload validation-00000-of-00001.parquet
1
#2 opened 26 days ago
by
lianghsun

playground是壞的
4
#2 opened 3 months ago
by
metalnow
🚩 Report: Legal issue(s)
#1 opened 3 months ago
by
wayne1998
Dataset Viewer issue: UnexpectedError
4
#2 opened 4 months ago
by
lianghsun

free-gpt-4o-chat
#1 opened 4 months ago
by
avadhuta
轉換成GGUF後的使用問題
7
#2 opened 4 months ago
by
AtwoodYen
要不要試試看 Llama-3.2-Taiwan-1B 😎
3
#1 opened 4 months ago
by
lianghsun

要不要試試看 Llama-3.2-Taiwan-1B 😎
#2 opened 4 months ago
by
lianghsun

Convert to GGUF format failed!
3
#1 opened 4 months ago
by
leotaipei
