21 17 1

Yi Cui

onekq

https://onekq.ai

AI & ML interests

Benchmark, Code Generation Model

Recent Activity

posted an update about 13 hours ago

This is bitter lesson 2.0 https://storage.googleapis.com/deepmind-media/Era-of-Experience%20/The%20Era%20of%20Experience%20Paper.pdf If this reads too lofty to you, consider some low-hanging fruits. Experiences here are reward signals we send to LLMs, e.g. human score in RLHF, verification in AlphaProof, or test results for code generation. RFT (reinforced finetuning) will become main stream, and IMO make LLMs behave more like agents.

updated a Space about 15 hours ago

onekq-ai/README

replied to their post 1 day ago

o4-mini beats o3-mini, and gets very close to SOTA 😄 https://huggingface.co/spaces/onekq-ai/WebApp1K-models-leaderboard

View all activity

Organizations

onekq's activity

posted an update about 13 hours ago

Post

309

This is bitter lesson 2.0
https://storage.googleapis.com/deepmind-media/Era-of-Experience%20/The%20Era%20of%20Experience%20Paper.pdf

If this reads too lofty to you, consider some low-hanging fruits. Experiences here are reward signals we send to LLMs, e.g. human score in RLHF, verification in AlphaProof, or test results for code generation.

RFT (reinforced finetuning) will become main stream, and IMO make LLMs behave more like agents.

updated a Space about 15 hours ago

README

🌖

replied to their post 1 day ago

You can distill it 😅

posted an update 1 day ago

Post

375

o4-mini beats o3-mini, and gets very close to SOTA 😄

onekq-ai/WebApp1K-models-leaderboard

2 replies

reacted to JLouisBiz's post with 👀 1 day ago

Post

1373

https://www.youtube.com/watch?v=84iS3atFQdI

**Speech typing in Emacs** by using NVIDIA Canary 1B model in multiple languages

This video showcases a demonstration of speech-to-text capabilities within the popular text editor, Emacs, utilizing the advanced NVIDIA Canary 1 Billion parameter (1B) language model. The presentation highlights how users can effectively type and edit documents across various programming or markup languages using spoken commands.

The demo likely illustrates seamless integration between cutting-edge AI technology from NVIDIA's Canary series—known for its powerful natural language processing capabilities—and Emacs, a highly customizable text editor favored by developers worldwide. By leveraging the 1B model, which is capable of understanding context and nuances in multiple human languages, users can dictate their code or prose directly into Emacs with impressive accuracy.

The video probably covers how this setup supports several different programming languages as well as natural language typing tasks, showcasing its versatility across various domains such as software development and content creation. Additionally, the demonstration may include examples of real-time transcription performance in diverse linguistic contexts to emphasize the model's multilingual proficiency.

Overall, viewers can expect insights into enhancing productivity by integrating AI-driven speech recognition directly within their text editing workflow using Emacs paired with NVIDIA’s advanced language models.

updated a Space 1 day ago

WebApp1K Models Leaderboard

🥇

Generate leaderboard for model performance metrics

reacted to ZennyKenny's post with 🤗 1 day ago

Post

1278

Submitted my first dataset for the Reasoning Datasets Competition! ZennyKenny/TRON-dataset-v.1.0

This dataset is designed to post-train Metareasoning agents, or those agents whose job it is to quickly (and importantly, cheaply) reason through whether it makes sense to launch a full reasoning job or simply use a simple completions job.

There's still plenty of time to join the competition! https://www.bespokelabs.ai/blog/reasoning-datasets-competition

Generation notebook (linked in dataset) is open source and pretty well generalized if I don't say so myself, so you can use it to make your own Metareasoning datasets.

Shoutout to @onekq for his inspiring comment on this topic.

posted an update 3 days ago

Post

254

GPT 4.1 slightly beats R1. but behind o3 mini.

The SOTA of WebApp1K hasn't changed for 6 months 😑

onekq-ai/WebApp1K-models-leaderboard

posted an update 4 days ago

Post

588

I made an app for Mac users to (1) make the SQL task easy with the help of UI and (2) leverage their on-device GPU.

You can download the DMG here
https://github.com/onekq/onekq-apple/releases/latest

Thoughts and suggestions are highly appreciated.

replied to their post 4 days ago

Agree on NPU, looking forward to seeing more of them hitting the fab.

posted an update 5 days ago

Post

1968

I used three posts to explain GPU/CPU and LLM performances, now finally circle back to my own model.😅

OneSQL needs GPU because it processes long prompt. It is not a chatbot which replies short prompts with long answers. I call models of my kind workhorse models.

We all have to scramble for GPUs to get adoption. Below are a few ways.

You can inherit it. If you have a new Mac machine. Congratulations, you have GPU.

You can leverage it. Get inference providers to adopt your model, then you switch from CapEx to OpEx.

Or you buy it. Go frugal. Find older GPUs with enough HBMs to house your model.

posted an update 6 days ago

Post

685

I just compared tasks with different input/output lengths. CPU/GPU performances are very different here.

The LLMs we use today are autoregressive or causal inference models, meaning the generation of each output token depends on all previous tokens. Since the model must generate one token at a time, it sets a hard limit on parallelism. The chatbot simulating human typing is in fact a UI trick to gloss over this fundamental limit. This is great news for CPUs because it levels the playing field.

But when processing input tokens, this limit doesn't exist. The GPU can fire up thousands of cores (vs dozens of CPU cores) to process as many input tokens as it can, all at once. Here, GPU enjoys a significant speed margin over CPU. The longer the prompt, the bigger the margin.

So, when it comes to user experience, both GPU and CPU can output text at decent speed. What really distinguishes them is the initial wait time, i.e. prompt processing delay.

1 reply

replied to ZennyKenny's post 7 days ago

Benchmarks nowadays focus on accuracy. It would be great if we could factor in token cost, i.e. delivering the right answer with the fewest tokens. This would motivate the training to be inference efficient.

I used to complain that models don't bother to think if a problem is worthy of reasoning, and push the burden to users. We should do better on this.

posted an update 8 days ago

Post

984

I just compared CPU vs GPU. CPU is actually good for tasks with short prompt and long answer. For such tasks, we usually treat LLM as consultant or teacher.

Say you are filing taxes and ask "what is form XXXX?" The chat bot will return an essay to explain the form and walk you through scenarios.

But when you decide to file this form, LLM becomes your assistant/agent. Suddenly the prompt becomes (much) longer than the answer. You throw in bunch of documents, and ask the LLM to fill out the form for you.

This is when we need GPU. I will get into details in the next post.

1 reply

posted an update 9 days ago

Post

2566

We desperately need GPU for model inference. CPU can't replace GPU.

I will start with the basics. GPU is designed to serve predictable workloads with many parallel units (pixels, tensors, tokens). So a GPU allocates as much transistor budget as possible to build thousands of compute units (Cuda cores in NVidia or execution units in Apple Silicon), each capable of running a thread.

But CPU is designed to handle all kinds of workloads. CPU cores are much larger (hence a lot fewer) with branch prediction and other complex things. In addition, more and more transistors are allocated to build larger cache (~50% now) to house the unpredictable, devouring the compute budget.

Generalists can't beat specialists.

4 replies

posted an update 11 days ago

Post

1521

Llama 4 is ... better than Llama 3.x

onekq-ai/WebApp1K-models-leaderboard

1 reply

posted an update 12 days ago

Post

2167

10K downloads! 🚀 Wow, unbelievable. Thank you everyone! Only on 🤗HF🤗

onekq-ai/onesql-v01-qwen-67d8e3eb1611c5532bb90c5f

reacted to clem's post with 🔥 15 days ago

Post

1938

Llama models (arguably the most successful open AI models of all times) just represented 3% of total model downloads on Hugging Face in March.

People and media like stories of winner takes all & one model/company to rule them all but the reality is much more nuanced than this!

Kudos to all the small AI builders out there!

2 replies

posted an update 15 days ago

Post

1896

Here is OneSQL 1.5B and quantizations. Accuracy loss is compensated by speed and ubiquity. Now SQL generation on Apple M1 is around 4 seconds, closed to real time.

onekq-ai/onesql-v01-qwen-67d8e3eb1611c5532bb90c5f

updated a model 15 days ago

onekq-ai/OneSQL-v0.1-Qwen-1.5B-GGUF

Updated 15 days ago • 2.64k