Update README.md
Browse files
README.md
CHANGED
@@ -841,7 +841,7 @@ print(pipe(prompt_template)[0]['generated_text'])
|
|
841 |
|
842 |
This will work with AutoGPTQ. It is untested with GPTQ-for-LLaMa. It will *not* work with ExLlama.
|
843 |
|
844 |
-
It was created with group_size none (-1) to reduce VRAM usage, and with --act-order (desc_act) to
|
845 |
|
846 |
* `gptq_model-4bit-128g.safetensors`
|
847 |
* Works with AutoGPTQ in CUDA or Triton modes.
|
@@ -856,7 +856,7 @@ It was created with group_size none (-1) to reduce VRAM usage, and with --act-or
|
|
856 |
|
857 |
This will work with AutoGPTQ. It is untested with GPTQ-for-LLaMa. It will *not* work with ExLlama.
|
858 |
|
859 |
-
It was created with both group_size 128g and --act-order (desc_act) for
|
860 |
|
861 |
**Note** Using group_size + desc_act together can significantly lower performance in AutoGPTQ CUDA. You might want to try AutoGPTQ Triton mode instead (Linux only.)
|
862 |
|
|
|
841 |
|
842 |
This will work with AutoGPTQ. It is untested with GPTQ-for-LLaMa. It will *not* work with ExLlama.
|
843 |
|
844 |
+
It was created with group_size none (-1) to reduce VRAM usage, and with --act-order (desc_act) to improve accuracy of responses.
|
845 |
|
846 |
* `gptq_model-4bit-128g.safetensors`
|
847 |
* Works with AutoGPTQ in CUDA or Triton modes.
|
|
|
856 |
|
857 |
This will work with AutoGPTQ. It is untested with GPTQ-for-LLaMa. It will *not* work with ExLlama.
|
858 |
|
859 |
+
It was created with both group_size 128g and --act-order (desc_act) for even higher inference accuracy, at the cost of increased VRAM usage. Because we already need 2 x 80GB or 3 x 48GB GPUs, I don't expect the increased VRAM usage to change the GPU requirements.
|
860 |
|
861 |
**Note** Using group_size + desc_act together can significantly lower performance in AutoGPTQ CUDA. You might want to try AutoGPTQ Triton mode instead (Linux only.)
|
862 |
|