How does LiveCodeBench test?
I would like to reproduce the LiveCodeBench results for QwQ-32B. Could you please tell me which code repository you used and what configuration was used for the testing?
Please refer to the EvalScope: https://github.com/modelscope/evalscope
The LiveCodeBench has been supported for QwQ-32B :)
Thanks for your reply!
I want to know what kind of hyperparameter configuration you used to evaluate LiveCodeBench?
The results measured using the official LiveCodeBench configuration are lower than what you reported.
For specific evaluating steps, pls refer to the best practice: https://evalscope.readthedocs.io/en/latest/best_practice/eval_qwq.html#evaluating-code-capability
We conducted the evaluation of QwQ-32B based on the official code implementation of LiveCodeBench.
Indeed, as you mentioned, our results are slightly lower by 1pt compared to those provided in the technical report for QwQ-32B.
We speculate that this may be related to factors such as prompt construction and inference parameter settings.