One Model to Train them All: Hierarchical Self-Distillation for Enhanced Early Layer Embeddings Paper • 2503.03008 • Published 10 days ago