Post
1786
Supercool Weekend Read🤖
Nvidia researchers achieved SOTA LLM compression metrics using pruning and knowledge distillation techniques.
Details on Techniques (Simplified):
They started off with a large pre-trained language model (15B params), then:
1. Estimated the importance of different parts of the model (neurons, attention heads, layers) using activation-based metrics on a small calibration dataset.
2. Pruned (remove) less important parts of the model to reduce its size.
3. Retrained the pruned model using knowledge distillation, where the original large model acts as a teacher for the smaller pruned model.
4. Used a lightweight neural architecture search to find the best configuration for the pruned model.
5. Repeated this process iteratively to create even smaller models.
Cool, giving it a try this weekend 😎
Code: https://github.com/NVlabs/Minitron
Paper: https://arxiv.org/abs/2407.14679
Demo: nvidia/minitron
Nvidia researchers achieved SOTA LLM compression metrics using pruning and knowledge distillation techniques.
Details on Techniques (Simplified):
They started off with a large pre-trained language model (15B params), then:
1. Estimated the importance of different parts of the model (neurons, attention heads, layers) using activation-based metrics on a small calibration dataset.
2. Pruned (remove) less important parts of the model to reduce its size.
3. Retrained the pruned model using knowledge distillation, where the original large model acts as a teacher for the smaller pruned model.
4. Used a lightweight neural architecture search to find the best configuration for the pruned model.
5. Repeated this process iteratively to create even smaller models.
Cool, giving it a try this weekend 😎
Code: https://github.com/NVlabs/Minitron
Paper: https://arxiv.org/abs/2407.14679
Demo: nvidia/minitron