685B? what are extra parameters as compared to 671B
Does anybody know what are the extra params?
Commenting to follow
If you read the technical paper, you will know: https://arxiv.org/html/2412.19437v1
Inspired by Gloeckle et al. (2024), we investigate and set a Multi-Token Prediction (MTP) objective for DeepSeek-V3, which extends the prediction scope to multiple future tokens at each position. On the one hand, an MTP objective densifies the training signals and may improve data efficiency. On the other hand, MTP may enable the model to pre-plan its representations for better prediction of future tokens. Figure 3 illustrates our implementation of MTP. Different from Gloeckle et al. (2024), which parallelly predicts
D additional tokens using independent output heads, we sequentially predict additional tokens and keep the complete causal chain at each prediction depth. We introduce the details of our MTP implementation in this section.
DeepSeek-V3-0324 has an extra output head (14B) which was used to speed up training and can be used as a draft model if the code supports it to further speed up inference.