After recent study of bf16 accuracy issues, we need fp32
Recent studies found and concluded that PyTorch (v2.9.1+cu130) is using JIT translation to run Hopper PTX instructions on Blackwell hardware.
ISA Mismatch: The Instruction Set Architecture (ISA) for tensor cores has changed between Hopper and consumer Blackwell. The forward compatibility layer is mistranslating these instructions. Instead of throwing an error or crashing, the hardware performs "silently wrong math," which is particularly dangerous for machine learning model training and inference.
This may never be fixed, but a dev fp32 version shouldn't be difficult to share to make sure people have the best shot at tuning and training.
Do you have a reference to the issue/bug? Was this issue solved on newer PyTorch versions?
Been trying to find that article on arxiv. It may have been patched but it's maybe the 7th major slander I've seen about bf16. I always point to this one https://arxiv.org/html/2510.26788v1#S3 specifically as they directly isolate how it just simply has inferior training accuracy. I mean that's just a widely accepted fact at this point. If we're training then we need best proven accuracy. The whole Zimage scene as well is completely dominated by wasted and failed training/tuning and with that model bf16 is particularly crippling. Works fine as a quant to run but my point is about tuning access, cast to fp16 may even be a better option if the full fp32 is only a dev in-house only.