GemmaSdpaAttention vs GemmaAttention

#71
by canqin001 - opened

Hi. I have loaded the gemma-2b model with two different machines. One is "GemmaSdpaAttention" and the other is "GemmaSdpaAttention". The results of these two are different despite that I used the same ckpt. Does anyone have the similar problem and know the reason? Thanks!

canqin001 changed discussion status to closed

@canqin001 out of curiosity, have you found the root cause?

I have not found the reason but I found the solution. Please refer to https://github.com/huggingface/transformers/blob/c409cd81777fb27aadc043ed3d8339dbc020fb3b/src/transformers/models/gemma/modeling_gemma.py#L558
We can use "_attn_implementation" to decide which version to use.

Sign up or log in to comment