Bidirectional or Casual?

#11
by AlignLearner - opened
class BidirectionalMistralModel(MistralModel):
    config_class = BidirectionalMistralConfig
    
    def __init__(self, config: MistralConfig):
        super().__init__(config)
        for layer in self.layers:
            layer.self_attn.is_causal = False
        self._attn_implementation = "eager"

However, MistralAttention doesn't use is_causal.

MistralFlashAttention2 uses layer.self_attn.is_causal.
MistralSdpaAttention doesn't use layer.self_attn.is_causal.

nada5 changed discussion status to closed
NVIDIA org

Hi, @AlignLearner . In fact, NV-Embed adopt the eager mode that does not use spda attention.

Sign up or log in to comment