Hugging Face
Models
Datasets
Spaces
Buckets
new
Docs
Enterprise
Pricing
Website
Tasks
HuggingChat
Collections
Languages
Organizations
Community
Blog
Posts
Daily Papers
Learn
Discord
Forum
GitHub
Solutions
Team & Enterprise
Hugging Face PRO
Enterprise Support
Inference Providers
Inference Endpoints
Storage Buckets
Log In
Sign Up
Open to Collab
293.1
TFLOPS
1799
580
918
Pedro Cuenca
PRO
pcuenq
Follow
VonNaturAustreVE's profile picture
megrisdal's profile picture
przvl's profile picture
1,705 followers
Β·
678 following
pcuenq
pcuenca
pcuenq
pcuenq.hf.co
AI & ML interests
None yet
Recent Activity
liked
a model
about 3 hours ago
unsloth/gemma-4-26B-A4B-it-GGUF
liked
a model
about 3 hours ago
ggml-org/Qwen3.6-35B-A3B-MTP-GGUF
reacted
to
alvarobartt
's
post
with π
about 7 hours ago
Latest `hf-mem` release added a breakdown of Mixture-of-Experts (MoE) memory usage! TL; DR MoEs can be misleading to reason about from active parameters alone, since each token only activates a subset of experts, while the serving setup still needs to account for the full resident memory footprint. π§ `hf-mem` now splits MoE memory into base model weights, routed experts, and KV cache ποΈ Dense models usually load and use most weights every forward pass, while MoEs load many experts but only route each token to a few of them β‘ Active params isn't the same as memory footprint, especially for sparse architectures π¦ Runtime memory is about what is used per request/token, while loading memory also includes the expert weights that need to be resident π KV cache can still dominate depending on context length, batch size, and concurrency π Expert Parallelism (EP) helps shard experts across accelerators when expert weights dominate π Data Parallelism (DP) + EP is often a good fit for throughput-oriented MoE serving Check the repository at https://github.com/alvarobartt/hf-mem
View all activity
Organizations
pcuenq
's buckets
3
Sort:Β Recently updated
pcuenq/stale-pr
15 Bytes
pcuenq/stale-1
15 Bytes
pcuenq/cuda-references
215 GB