ValueError: QWenLMHeadModel does not support Flash Attention 2.0 yet.
#1
by
sanjeev-bhandari01
- opened
Will there be flash_attention
implementation in this model
Qwen(1.0) uses custom code and automatically enable flash-attention (v2) and you don't need to pass the argument when loading the model. Related info will be printed to logs. See the following for more info: https://github.com/QwenLM/Qwen
Environment: Google Colab
GPU Info:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 68C P0 30W / 70W | 12151MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
Error traceback:
RuntimeError Traceback (most recent call last)
<ipython-input-3-627d21222f84> in <cell line: 1>()
----> 1 response, history = model.chat(tokenizer, "How are you", history=None)
2 print(response)
26 frames
/usr/local/lib/python3.10/dist-packages/flash_attn/flash_attn_interface.py in _flash_attn_forward(q, k, v, dropout_p, softmax_scale, causal, window_size, alibi_slopes, return_softmax)
49 maybe_contiguous = lambda x: x.contiguous() if x.stride(-1) != 1 else x
50 q, k, v = [maybe_contiguous(x) for x in (q, k, v)]
---> 51 out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.fwd(
52 q,
53 k,
RuntimeError: FlashAttention only supports Ampere GPUs or newer.
Now I am facing this problem? Thank You.
Explanation Updated:
Google Colab Free tier GPU doesnot support Flash-Attention