unexpected results

#1
by ShikaiChen - opened

Hi,

I’m having trouble replicating the results using the sample code for the 27B model(bf16). The expected output is:

Output:

27B:

Score for response 1: 0.5625

Score for response 2: -8.5

However, I’m getting:

Score for response 1: -4.5
Score for response 2: -6.03125

Additionally, when the model is loaded in float16, it produces NaN values.

Score for response 1: nan
Score for response 2: nan

I have verified the SHA256 of all files, and the results for the 8B model (bf16) match as expected. Could you please advise if there's anything else I should check, or perhaps verify if the uploaded weights are correct?

Thanks!

Skywork org

Hi,

After conducting further checks across multiple machines, we can confirm that our results are correct. To ensure consistency, could you kindly verify that the relevant packages are up to date? Please see the versions we used below:

•	transformers==4.45.2
•	flash-attn==2.6.3
•	torch==2.5.0

Additionally, we recommend loading the model with bfloat16 and flash_attention_2. Our models were trained and evaluated using bfloat16, and we have not yet tested them with regular float16. If possible, we suggest using bfloat16 for optimal performance.

Please let us know if you encounter any further issues or need additional information.

Hi,

After conducting further checks across multiple machines, we can confirm that our results are correct. To ensure consistency, could you kindly verify that the relevant packages are up to date? Please see the versions we used below:

•	transformers==4.45.2
•	flash-attn==2.6.3
•	torch==2.5.0

Additionally, we recommend loading the model with bfloat16 and flash_attention_2. Our models were trained and evaluated using bfloat16, and we have not yet tested them with regular float16. If possible, we suggest using bfloat16 for optimal performance.

Please let us know if you encounter any further issues or need additional information.

Hi,
After uninstalling flash-attn-2.5.6 and installing flash-attn-2.6.3, everything is working correctly. Thank you for your help.

Also, I noticed that the 8B model performs slightly better when tested using float16 on RewardBench. Thank you for your help.

chrisliu298 changed discussion status to closed

Sign up or log in to comment