🤝 Open to Collab

Rémi Ouazan Reboul

ror

14 22 6

AI & ML interests

None yet

Recent Activity

upvoted a collection 3 days ago

Llama 4

upvoted a paper 22 days ago

GLM-5: from Vibe Coding to Agentic Engineering

upvoted an article 28 days ago

Profiling in PyTorch (Part 1): A Beginner's Guide to torch.profiler

View all activity

Organizations

upvoted a collection 3 days ago

Llama 4

Collection

Llama 4 release • 13 items • Updated Apr 29, 2025 • 739

upvoted a paper 22 days ago

GLM-5: from Vibe Coding to Agentic Engineering

Paper • 2602.15763 • Published Feb 17 • 196

upvoted 2 articles 28 days ago

Article

Profiling in PyTorch (Part 1): A Beginner's Guide to torch.profiler

ariG23498, sayakpaul, sergiopaniego, ror, pcuenq

•

May 29

• 136

Article

Profiling in PyTorch (Part 2): From nn.Linear to a Fused MLP

ariG23498, ror, sergiopaniego, pcuenq, sayakpaul

•

28 days ago

• 52

published an article 28 days ago

Article

Profiling in PyTorch (Part 2): From nn.Linear to a Fused MLP

ariG23498, ror, sergiopaniego, pcuenq, sayakpaul

•

28 days ago

• 52

upvoted an article 30 days ago

Article

How an Agent Built a 3D Paris Gallery by Chaining Two Hugging Face Spaces

mishig

•

30 days ago

• 23

published an article about 1 month ago

Article

Profiling in PyTorch (Part 1): A Beginner's Guide to torch.profiler

ariG23498, sayakpaul, sergiopaniego, ror, pcuenq

•

May 29

• 136

updated a Space about 2 months ago

Transformers Point Cloud

🐝

Explore transformer repo files in a 3D point cloud

published a Space about 2 months ago

Transformers Point Cloud

🐝

Explore transformer repo files in a 3D point cloud

updated a Space about 2 months ago

Transformers Point Cloud

🐝

Explore and zoom into a 3D file map with chat commands

published a Space about 2 months ago

Transformers Point Cloud

🐝

Explore and zoom into a 3D file map with chat commands

commented on Unlocking asynchronicity in continuous batching about 2 months ago

The inputs and outputs are actually pre-allocated static tensors, before any CUDA graph is even created.

As for why the pool is useful:
Say you capture a CUDA graph A: itneeds memory to execute. This memory is needed to store activations or workspace tensors for the kernels launched in graph A. This allocated memory is owned by graph A.
Then, if you create another graph B, it will also allocate some memory for its execution. Because CUDA can't be sure you won't run graph A and B at the same time, the memory allocated for graph A and B can't be the same: you would risk data corruption. But if you can guarantee graph A and B won't run at the same time, then there is no reason not to allocate the same memory to graph A and B. That memory shared between graphs A and B is the memory pool.

And as a sidenote, if you know which graph is going to need the most memory, you better capture it first. That way, the graph pool has the maximum size right away, and graph you capture afterwards can always fit inside the pool. Whereas if you capture a graph that requires a low amount of memory and then try to capture a graph that requires more memory, you run the risk of having memory fragmentation.

commented on Unlocking asynchronicity in continuous batching about 2 months ago

Thank you!

upvoted an article about 2 months ago

Article

Unlocking asynchronicity in continuous batching

ror, pcuenq, ariG23498

•

May 14

• 61

published an article about 2 months ago

Article

Unlocking asynchronicity in continuous batching

ror, pcuenq, ariG23498

•

May 14

• 61

updated a dataset about 2 months ago

huggingface/documentation-images

Viewer • Updated about 17 hours ago • 59 • 3M • 163

upvoted an article about 2 months ago

Article

Two Years of Local AI on a Laptop: When Open Models Outpaced Moore's Law

mishig

•

May 11

• 24

New activity in huggingface/documentation-images 2 months ago

Upload images for the continuous async blog post

#611 opened 2 months ago by

ror

reacted to qgallouedec's post with 🔥 3 months ago

Post

2070

TRL v1.2 introduces the SSDTrainer 🚀

Simple Self-Distillation (SSD) from Apple's paper "Embarrassingly Simple Self-Distillation Improves Code Generation" is now available as an experimental trainer in TRL.

The recipe is as minimal as the name suggests: sample completions from the model itself at a training-time temperature, then fine-tune on those raw, unverified samples with plain cross-entropy. No reward model. No verifier. No teacher model. No reinforcement learning. Just prompts and the model.

from trl.experimental.ssd import SSDConfig, SSDTrainer

trainer = SSDTrainer(
    model="Qwen/Qwen3-4B-Instruct",
    args=SSDConfig(temperature=0.6, top_k=20, top_p=0.95),
    train_dataset=dataset,
)
trainer.train()

v1.2 also ships expanded tool-calling support (LLaMA 3.1 / 3.2, DeepSeek-V3), another round of KTO ↔ DPO alignment getting us closer to promoting KTO to stable, a big GRPO simplification for overlong tool results, deprecation of use_transformers_paged, and key fixes for VLM response parsing.

Full release notes: https://github.com/huggingface/trl/releases/tag/v1.2.0

liked a Space 5 months ago

Can LLMs Play the Game of Science?

📝

Explore LLM science benchmark scores

Rémi Ouazan Reboul

AI & ML interests

Recent Activity

Organizations

ror's activity

Profiling in PyTorch (Part 1): A Beginner's Guide to torch.profiler

Profiling in PyTorch (Part 2): From nn.Linear to a Fused MLP

Profiling in PyTorch (Part 2): From nn.Linear to a Fused MLP

How an Agent Built a 3D Paris Gallery by Chaining Two Hugging Face Spaces

Profiling in PyTorch (Part 1): A Beginner's Guide to torch.profiler

Transformers Point Cloud

Transformers Point Cloud

Transformers Point Cloud

Transformers Point Cloud

Unlocking asynchronicity in continuous batching

Unlocking asynchronicity in continuous batching

Two Years of Local AI on a Laptop: When Open Models Outpaced Moore's Law

Upload images for the continuous async blog post

Can LLMs Play the Game of Science?