55 2 101

Mendonca

Dihelson

AI & ML interests

None yet

Recent Activity

liked a model about 1 month ago

mradermacher/L3-Umbral-Mind-RP-v0.3-8B-i1-GGUF

liked a model about 1 month ago

mradermacher/Reasoning-Llama-3.1-CoT-RE1-NMT-V2-ORPO-i1-GGUF

liked a model about 1 month ago

Casual-Autopsy/L3-Super-Nova-RP-8B-4.3bpw-h6-exl2

View all activity

Organizations

None yet

Dihelson's activity

liked 5 models about 1 month ago

liked a model about 2 months ago

deepseek-ai/DeepSeek-R1-Distill-Llama-8B

Text Generation • Updated 18 days ago • 1.53M • • 639

liked a Space 7 months ago

Phi-3.5 WebGPU

⚡

A powerful AI chatbot that runs locally in your browser

reacted to Jaward's post with 👀❤️ 7 months ago

Post

1786

Supercool Weekend Read🤖
Nvidia researchers achieved SOTA LLM compression metrics using pruning and knowledge distillation techniques.

Details on Techniques (Simplified):
They started off with a large pre-trained language model (15B params), then:

1. Estimated the importance of different parts of the model (neurons, attention heads, layers) using activation-based metrics on a small calibration dataset.

2. Pruned (remove) less important parts of the model to reduce its size.

3. Retrained the pruned model using knowledge distillation, where the original large model acts as a teacher for the smaller pruned model.

4. Used a lightweight neural architecture search to find the best configuration for the pruned model.

5. Repeated this process iteratively to create even smaller models.

Cool, giving it a try this weekend 😎
Code: https://github.com/NVlabs/Minitron
Paper: https://arxiv.org/abs/2407.14679
Demo: nvidia/minitron