Not distilled!

by supercharge19 - opened Jan 21

Jan 21

Model has same number of parameters as that of original llama3.1 8b, and have the same size and both have fp16 precesion. So, what is distilled here? You probably didn't distill this model just used base model and trained it through reinforcement learning. Or am I missing something.

dimitadi

Jan 21

Uhm they SFTed the Llama model on data generated with the R1. This is the textbook definition of distillation....

supercharge19

Jan 21

Oh, knowledge distillation I see. Sorry I thought it was about model compression (from larger model).

damianlewis

Jan 22

Uhm that's not a thing. Distillation is the process of taking the sparse large model and finetuning a small dense model. Compression is the process of reducing the size of a model without retraining it by reducing the bits-per-parameter of its activation weights thus preserving the original architecture and number of parameters.

GeeekExplorer changed discussion status to closed Jan 22

Fizzarolli

Jan 22

FWIW the txt2img community uses "distillation" to mean a bigger model distilled into either a smaller parameter-wise or noise-schedule-wise one, gods be damned how -- and i think that layer of not actually knowing what distillation is is where the whole "distilled models can't be finetuned / are completely different / etc etc" misconception comes from

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment