Supercool Weekend Readπ€ Nvidia researchers achieved SOTA LLM compression metrics using pruning and knowledge distillation techniques.
Details on Techniques (Simplified): They started off with a large pre-trained language model (15B params), then:
1. Estimated the importance of different parts of the model (neurons, attention heads, layers) using activation-based metrics on a small calibration dataset.
2. Pruned (remove) less important parts of the model to reduce its size.
3. Retrained the pruned model using knowledge distillation, where the original large model acts as a teacher for the smaller pruned model.
4. Used a lightweight neural architecture search to find the best configuration for the pruned model.
5. Repeated this process iteratively to create even smaller models.