Not distilled!
Model has same number of parameters as that of original llama3.1 8b, and have the same size and both have fp16 precesion. So, what is distilled here? You probably didn't distill this model just used base model and trained it through reinforcement learning. Or am I missing something.
Uhm they SFTed the Llama model on data generated with the R1. This is the textbook definition of distillation....
Oh, knowledge distillation I see. Sorry I thought it was about model compression (from larger model).
Uhm that's not a thing. Distillation is the process of taking the sparse large model and finetuning a small dense model. Compression is the process of reducing the size of a model without retraining it by reducing the bits-per-parameter of its activation weights thus preserving the original architecture and number of parameters.
FWIW the txt2img community uses "distillation" to mean a bigger model distilled into either a smaller parameter-wise or noise-schedule-wise one, gods be damned how -- and i think that layer of not actually knowing what distillation is is where the whole "distilled models can't be finetuned / are completely different / etc etc" misconception comes from