It also works on Windows, using last update of the CUDA GPTQ tree.
#1
by
BGLuck
- opened
Hi there, just wanted to mention that the act order models also work on Windows, using the CUDA GPTQ tree (https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/cuda)
Tested on 2x4090 and it works without issues. (abeit slowly)
F:\ChatIAs\oobabooga\bats\bats\nopassword>call F:\ChatIAs\oobabooga\venv\Scripts\activate.bat
Gradio HTTP request redirected to localhost :)
Starting streaming server at ws://0.0.0.0:5005/api/v1/stream
Loading the extension "gallery"... Ok.
Starting API at http://0.0.0.0:5000/api
Running on local URL: http://0.0.0.0:7990
To create a public link, set `share=True` in `launch()`.
Loading alpaca-lora-65B-GPTQ-4bit_128g...
Found the following quantized model: models\alpaca-lora-65B-GPTQ-4bit_128g\alpaca-lora-65B-GPTQ-4bit-128g.safetensors
Loading model ...
Done.
Using the following device map for the quantized model: {'model.embed_tokens': 0, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0, 'model.layers.4': 0, 'model.layers.5': 0, 'model.layers.6': 0, 'model.layers.7': 0, 'model.layers.8': 0, 'model.layers.9': 0, 'model.layers.10': 0, 'model.layers.11': 0, 'model.layers.12': 0, 'model.layers.13': 0, 'model.layers.14': 0, 'model.layers.15': 0, 'model.layers.16': 0, 'model.layers.17': 0, 'model.layers.18': 0, 'model.layers.19': 0, 'model.layers.20': 0, 'model.layers.21': 0, 'model.layers.22': 0, 'model.layers.23': 0, 'model.layers.24': 0, 'model.layers.25': 0, 'model.layers.26': 0, 'model.layers.27': 0, 'model.layers.28': 0, 'model.layers.29': 0, 'model.layers.30': 0, 'model.layers.31': 0, 'model.layers.32': 0, 'model.layers.33': 0, 'model.layers.34': 0, 'model.layers.35': 0, 'model.layers.36': 0, 'model.layers.37': 1, 'model.layers.38': 1, 'model.layers.39': 1, 'model.layers.40': 1, 'model.layers.41': 1, 'model.layers.42': 1, 'model.layers.43': 1, 'model.layers.44': 1, 'model.layers.45': 1, 'model.layers.46': 1, 'model.layers.47': 1, 'model.layers.48': 1, 'model.layers.49': 1, 'model.layers.50': 1, 'model.layers.51': 1, 'model.layers.52': 1, 'model.layers.53': 1, 'model.layers.54': 1, 'model.layers.55': 1, 'model.layers.56': 1, 'model.layers.57': 1, 'model.layers.58': 1, 'model.layers.59': 1, 'model.layers.60': 1, 'model.layers.61': 1, 'model.layers.62': 1, 'model.layers.63': 1, 'model.layers.64': 1, 'model.layers.65': 1, 'model.layers.66': 1, 'model.layers.67': 1, 'model.layers.68': 1, 'model.layers.69': 1, 'model.layers.70': 1, 'model.layers.71': 1, 'model.layers.72': 1, 'model.layers.73': 1, 'model.layers.74': 1, 'model.layers.75': 1, 'model.layers.76': 1, 'model.layers.77': 1, 'model.layers.78': 1, 'model.layers.79': 1, 'model.norm': 1, 'lm_head': 1}
Replaced attention with sdp_attention
Loaded the model in 117.40 seconds.
Output generated in 40.38 seconds (2.20 tokens/s, 89 tokens, context 51, seed 968699792)
Output generated in 45.34 seconds (1.06 tokens/s, 48 tokens, context 853, seed 871994252)
Output generated in 25.02 seconds (1.32 tokens/s, 33 tokens, context 853, seed 2031923582)
OK, thank you!
In my earlier READMEs I used to tell people that they could use the latest CUDA branch as well, and that this would work on Windows. But then I got so many comments from people who couldn't get it working or only got gibberish that I stopped listing it as working.
But now I look again I see that the CUDA branch has had several new commits, so I guess this must be fixed now.
Thanks for reporting it. I will update the README to indicate that CUDA is viable as well.