Hi @sirluk , thanks for the great post. Do you know if the above masking technique works for some attention implementations and would be incompatible with some other?
For example, would the above masking work with SDPA/flash_attention_2 and eager (each of these implementations are dealt a bit differently in https://github.com/huggingface/transformers/blob/main/src/transformers/models/mistral/modeling_mistral.py#L666 for example)?