Yi Cui

onekq

AI & ML interests

Benchmark, Code Generation Model

Recent Activity

Organizations

MLX Community's profile picture ONEKQ AI's profile picture

onekq's activity

posted an update about 13 hours ago
updated a Space about 15 hours ago
replied to their post 1 day ago
posted an update 1 day ago
reacted to JLouisBiz's post with ๐Ÿ‘€ 1 day ago
view post
Post
1373
https://www.youtube.com/watch?v=84iS3atFQdI

**Speech typing in Emacs** by using NVIDIA Canary 1B model in multiple languages

This video showcases a demonstration of speech-to-text capabilities within the popular text editor, Emacs, utilizing the advanced NVIDIA Canary 1 Billion parameter (1B) language model. The presentation highlights how users can effectively type and edit documents across various programming or markup languages using spoken commands.

The demo likely illustrates seamless integration between cutting-edge AI technology from NVIDIA's Canary seriesโ€”known for its powerful natural language processing capabilitiesโ€”and Emacs, a highly customizable text editor favored by developers worldwide. By leveraging the 1B model, which is capable of understanding context and nuances in multiple human languages, users can dictate their code or prose directly into Emacs with impressive accuracy.

The video probably covers how this setup supports several different programming languages as well as natural language typing tasks, showcasing its versatility across various domains such as software development and content creation. Additionally, the demonstration may include examples of real-time transcription performance in diverse linguistic contexts to emphasize the model's multilingual proficiency.

Overall, viewers can expect insights into enhancing productivity by integrating AI-driven speech recognition directly within their text editing workflow using Emacs paired with NVIDIAโ€™s advanced language models.
reacted to ZennyKenny's post with ๐Ÿค— 1 day ago
view post
Post
1278
Submitted my first dataset for the Reasoning Datasets Competition! ZennyKenny/TRON-dataset-v.1.0

This dataset is designed to post-train Metareasoning agents, or those agents whose job it is to quickly (and importantly, cheaply) reason through whether it makes sense to launch a full reasoning job or simply use a simple completions job.

There's still plenty of time to join the competition! https://www.bespokelabs.ai/blog/reasoning-datasets-competition

Generation notebook (linked in dataset) is open source and pretty well generalized if I don't say so myself, so you can use it to make your own Metareasoning datasets.

Shoutout to @onekq for his inspiring comment on this topic.
posted an update 3 days ago
posted an update 4 days ago
replied to their post 4 days ago
view reply

Agree on NPU, looking forward to seeing more of them hitting the fab.

posted an update 5 days ago
view post
Post
1968
I used three posts to explain GPU/CPU and LLM performances, now finally circle back to my own model.๐Ÿ˜…

OneSQL needs GPU because it processes long prompt. It is not a chatbot which replies short prompts with long answers. I call models of my kind workhorse models.

We all have to scramble for GPUs to get adoption. Below are a few ways.

You can inherit it. If you have a new Mac machine. Congratulations, you have GPU.

You can leverage it. Get inference providers to adopt your model, then you switch from CapEx to OpEx.

Or you buy it. Go frugal. Find older GPUs with enough HBMs to house your model.
posted an update 6 days ago
view post
Post
685
I just compared tasks with different input/output lengths. CPU/GPU performances are very different here.

The LLMs we use today are autoregressive or causal inference models, meaning the generation of each output token depends on all previous tokens. Since the model must generate one token at a time, it sets a hard limit on parallelism. The chatbot simulating human typing is in fact a UI trick to gloss over this fundamental limit. This is great news for CPUs because it levels the playing field.

But when processing input tokens, this limit doesn't exist. The GPU can fire up thousands of cores (vs dozens of CPU cores) to process as many input tokens as it can, all at once. Here, GPU enjoys a significant speed margin over CPU. The longer the prompt, the bigger the margin.

So, when it comes to user experience, both GPU and CPU can output text at decent speed. What really distinguishes them is the initial wait time, i.e. prompt processing delay.
  • 1 reply
ยท
replied to ZennyKenny's post 7 days ago
view reply

Benchmarks nowadays focus on accuracy. It would be great if we could factor in token cost, i.e. delivering the right answer with the fewest tokens. This would motivate the training to be inference efficient.

I used to complain that models don't bother to think if a problem is worthy of reasoning, and push the burden to users. We should do better on this.

posted an update 8 days ago
view post
Post
984
I just compared CPU vs GPU. CPU is actually good for tasks with short prompt and long answer. For such tasks, we usually treat LLM as consultant or teacher.

Say you are filing taxes and ask "what is form XXXX?" The chat bot will return an essay to explain the form and walk you through scenarios.

But when you decide to file this form, LLM becomes your assistant/agent. Suddenly the prompt becomes (much) longer than the answer. You throw in bunch of documents, and ask the LLM to fill out the form for you.

This is when we need GPU. I will get into details in the next post.
  • 1 reply
ยท
posted an update 9 days ago
view post
Post
2566
We desperately need GPU for model inference. CPU can't replace GPU.

I will start with the basics. GPU is designed to serve predictable workloads with many parallel units (pixels, tensors, tokens). So a GPU allocates as much transistor budget as possible to build thousands of compute units (Cuda cores in NVidia or execution units in Apple Silicon), each capable of running a thread.

But CPU is designed to handle all kinds of workloads. CPU cores are much larger (hence a lot fewer) with branch prediction and other complex things. In addition, more and more transistors are allocated to build larger cache (~50% now) to house the unpredictable, devouring the compute budget.

Generalists can't beat specialists.
ยท
posted an update 11 days ago
posted an update 12 days ago
reacted to clem's post with ๐Ÿ”ฅ 15 days ago
view post
Post
1938
Llama models (arguably the most successful open AI models of all times) just represented 3% of total model downloads on Hugging Face in March.

People and media like stories of winner takes all & one model/company to rule them all but the reality is much more nuanced than this!

Kudos to all the small AI builders out there!
  • 2 replies
ยท
posted an update 15 days ago