ShortGPT: Layers in Large Language Models are More Redundant Than You Expect
Paper
•
2403.03853
•
Published
•
61
Note 1. For each layer, compute the dot-product between the hidden state vectors of the input tokens (Xi,t) and the corresponding output hidden state vectors (Xi+1,t). If the input and output vectors are very similar, it implies the layer didn’t do much transformation and thus has a low BI (block influence). 2. Use calibration set to "profile" the model and compute layerwise BI over this evaluation set 3. Prune low BI blocks first