Confusion about the model?

#4
by as4art07 - opened

Did you arrive at this model by performing "deep self-attention distillation" by using "microsoft/MiniLM-L12-H384-uncased" as a teacher assistant (which was derived as a student of UniLMv2 as per the paper MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers
or
by directly removing every second layer from the already achieved student model of "microsoft/MiniLM-L12-H384-uncased"?

It isn't exactly clear to me. Could you please confirm?

I have the same question. Based on this quote from the paper "We distill the teacher model into 12-layer and 6-layer models with 384 hidden size using the same corpora. The 12x384 model is used as the teacher assistant to train the 6x384 model." I'm guessing this is the 6x384 model.

Sign up or log in to comment