vllm can not inter this model (other 70b gptq model are ok)

#1
by tutu329 - opened

exllama can infer this model but exllama is not very stable.
vllm is the perfect one. why can not vllm?

I am using this model in TGI without any issue. I used the latest AutoGPTQ to quantized this model. https://github.com/huggingface/text-generation-inference

@MaziyarPanahi Thanks for quantized model and sharing..How much VRAM do I need to load this version for inference?

using tgi is also out of vram.
docker run --gpus all --shm-size 1g -p 8001:8001 -v /home/tutu/models/miqu-1-70b-sf-GPTQ:/model ghcr.io/huggingface/text-generation-inference:1.4 --model-id /model --quantize gptq --hostname 0.0.0.0 --port 8001
using tgi for other gptq model is ok.
so strange.

is tokenizer_config.json correct? like "model_max_length"?

So this model is 8k (8192) for the max length. If you are short on vRAM, would make the max length down to 4k and also make sure cuda_fraction is 0.95 so you can use all the available GPU memory. (this is larger than other GPTQ 70b because it has double context length)

I am seeing the same on VLLM. I wonder if this is the watermarking?

So this is my TGI, and it's pretty fast!

{ model_id: "MaziyarPanahi/miqu-1-70b-sf-GPTQ", revision: None, validation_workers: 2, sharded: Some(true), num_shard: Some(4), quantize: Some(Gptq), speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 7100, max_total_tokens: 8192, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 8192, max_batch_total_tokens: Some(1044000), max_waiting_tokens: 20, hostname: "b869416c7485", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 0.9, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, env: false }
2024-02-08T18:44:08.167914Z  INFO text_generation_router: router/src/main.rs:420: Serving revision 010afdb6478a25946fc381a327c82b83a86e99b0 of model MaziyarPanahi/miqu-1-70b-sf-GPTQ
2024-02-08T18:44:08.167938Z  INFO text_generation_router: router/src/main.rs:237: Using the Hugging Face API to retrieve tokenizer config
2024-02-08T18:44:08.174077Z  INFO text_generation_router: router/src/main.rs:280: Warming up model
2024-02-08T18:44:18.394725Z  WARN text_generation_router: router/src/main.rs:301: `--max-batch-total-tokens` is deprecated for Flash Attention models.
2024-02-08T18:44:18.394748Z  WARN text_generation_router: router/src/main.rs:305: Inferred max batch total tokens: 419728
2024-02-08T18:44:18.394752Z  INFO text_generation_router: router/src/main.rs:316: Setting max batch total tokens to 419728
2024-02-08T18:44:18.394754Z  INFO text_generation_router: router/src/main.rs:317: Connected
2024-02-08T18:44:18.394758Z  WARN text_generation_router: router/src/main.rs:322: Invalid hostname, defaulting to 0.0.0.0
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe           On | 00000000:08:00.0 Off |                    0 |
| N/A   35C    P0               65W / 300W|  47394MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe           On | 00000000:48:00.0 Off |                    0 |
| N/A   33C    P0               62W / 300W|  47402MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100 80GB PCIe           On | 00000000:88:00.0 Off |                    0 |
| N/A   33C    P0               63W / 300W|  47402MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100 80GB PCIe           On | 00000000:C8:00.0 Off |                    0 |
| N/A   33C    P0               62W / 300W|  51522MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    313848      C   /opt/conda/bin/python3.10                 47392MiB |
|    1   N/A  N/A    313849      C   /opt/conda/bin/python3.10                 47400MiB |
|    2   N/A  N/A    313850      C   /opt/conda/bin/python3.10                 47400MiB |
|    3   N/A  N/A    313853      C   /opt/conda/bin/python3.10                 51520MiB |
+---------------------------------------------------------------------------------------+

Test:

What's Large Language Model? answer in 3 bullet points


Response:

1. A large language model is a type of artificial intelligence model that has been trained on a vast amount of text data to generate human-like text.
2. These models use machine learning algorithms to analyze patterns in the data and learn how to produce coherent and contextually relevant responses to a wide range of prompts.
3. Large language models can be used for a variety of natural language processing tasks, such as text generation, translation, summarization, and question answering, and are often used in virtual assistants, chatbots, and other conversational AI applications.
| Parameter          | Value                            |
|--------------------|----------------------------------|
| Model              | MaziyarPanahi/miqu-1-70b-sf-GPTQ |
| Sequence Length    | 10                               |
| Decode Length      | 8                                |
| Top N Tokens       | None                             |
| N Runs             | 10                               |
| Warmups            | 10                               |
| Temperature        | None                             |
| Top K              | None                             |
| Top P              | None                             |
| Typical P          | None                             |
| Repetition Penalty | None                             |
| Watermark          | false                            |
| Do Sample          | false                            |


| Step           | Batch Size | Average   | Lowest    | Highest   | p50       | p90       | p99       |
|----------------|------------|-----------|-----------|-----------|-----------|-----------|-----------|
| Prefill        | 1          | 45.41 ms  | 45.25 ms  | 45.93 ms  | 45.35 ms  | 45.93 ms  | 45.93 ms  |
|                | 2          | 59.71 ms  | 58.65 ms  | 66.89 ms  | 58.96 ms  | 66.89 ms  | 66.89 ms  |
|                | 4          | 84.12 ms  | 83.24 ms  | 85.01 ms  | 84.30 ms  | 85.01 ms  | 85.01 ms  |
|                | 8          | 104.74 ms | 102.54 ms | 114.02 ms | 102.91 ms | 114.02 ms | 114.02 ms |
|                | 16         | 139.02 ms | 136.19 ms | 147.54 ms | 137.76 ms | 147.54 ms | 147.54 ms |
|                | 32         | 207.08 ms | 203.38 ms | 213.40 ms | 205.43 ms | 213.40 ms | 213.40 ms |
|                | 64         | 342.57 ms | 342.08 ms | 343.25 ms | 342.64 ms | 343.25 ms | 343.25 ms |
|                | 128        | 629.88 ms | 629.04 ms | 630.63 ms | 630.18 ms | 630.63 ms | 630.63 ms |
| Decode (token) | 1          | 37.28 ms  | 35.48 ms  | 39.80 ms  | 37.70 ms  | 35.82 ms  | 35.82 ms  |
|                | 2          | 38.19 ms  | 36.31 ms  | 40.41 ms  | 38.23 ms  | 38.17 ms  | 38.17 ms  |
|                | 4          | 37.38 ms  | 36.12 ms  | 38.88 ms  | 37.67 ms  | 36.38 ms  | 36.38 ms  |
|                | 8          | 38.35 ms  | 36.94 ms  | 41.21 ms  | 38.19 ms  | 39.34 ms  | 39.34 ms  |
|                | 16         | 48.95 ms  | 47.28 ms  | 51.23 ms  | 49.03 ms  | 49.63 ms  | 49.63 ms  |
|                | 32         | 73.37 ms  | 72.74 ms  | 74.33 ms  | 73.37 ms  | 72.94 ms  | 72.94 ms  |
|                | 64         | 102.43 ms | 102.29 ms | 102.62 ms | 102.45 ms | 102.30 ms | 102.30 ms |
|                | 128        | 131.91 ms | 131.74 ms | 131.99 ms | 131.93 ms | 131.99 ms | 131.99 ms |
| Decode (total) | 1          | 260.95 ms | 248.35 ms | 278.60 ms | 263.92 ms | 250.75 ms | 250.75 ms |
|                | 2          | 267.34 ms | 254.16 ms | 282.88 ms | 267.59 ms | 267.23 ms | 267.23 ms |
|                | 4          | 261.67 ms | 252.85 ms | 272.19 ms | 263.67 ms | 254.64 ms | 254.64 ms |
|                | 8          | 268.42 ms | 258.59 ms | 288.47 ms | 267.33 ms | 275.39 ms | 275.39 ms |
|                | 16         | 342.65 ms | 330.96 ms | 358.62 ms | 343.24 ms | 347.44 ms | 347.44 ms |
|                | 32         | 513.62 ms | 509.20 ms | 520.30 ms | 513.58 ms | 510.60 ms | 510.60 ms |
|                | 64         | 717.02 ms | 716.04 ms | 718.36 ms | 717.18 ms | 716.07 ms | 716.07 ms |
|                | 128        | 923.36 ms | 922.17 ms | 923.91 ms | 923.54 ms | 923.91 ms | 923.91 ms |


| Step    | Batch Size | Average            | Lowest             | Highest            |
|---------|------------|--------------------|--------------------|--------------------|
| Prefill | 1          | 22.02 tokens/secs  | 21.77 tokens/secs  | 22.10 tokens/secs  |
|         | 2          | 33.55 tokens/secs  | 29.90 tokens/secs  | 34.10 tokens/secs  |
|         | 4          | 47.55 tokens/secs  | 47.05 tokens/secs  | 48.05 tokens/secs  |
|         | 8          | 76.46 tokens/secs  | 70.16 tokens/secs  | 78.02 tokens/secs  |
|         | 16         | 115.17 tokens/secs | 108.45 tokens/secs | 117.48 tokens/secs |
|         | 32         | 154.58 tokens/secs | 149.95 tokens/secs | 157.34 tokens/secs |
|         | 64         | 186.82 tokens/secs | 186.46 tokens/secs | 187.09 tokens/secs |
|         | 128        | 203.21 tokens/secs | 202.97 tokens/secs | 203.48 tokens/secs |
| Decode  | 1          | 26.87 tokens/secs  | 25.13 tokens/secs  | 28.19 tokens/secs  |
|         | 2          | 52.43 tokens/secs  | 49.49 tokens/secs  | 55.08 tokens/secs  |
|         | 4          | 107.09 tokens/secs | 102.87 tokens/secs | 110.74 tokens/secs |
|         | 8          | 208.98 tokens/secs | 194.13 tokens/secs | 216.56 tokens/secs |
|         | 16         | 327.11 tokens/secs | 312.31 tokens/secs | 338.41 tokens/secs |
|         | 32         | 436.14 tokens/secs | 430.52 tokens/secs | 439.90 tokens/secs |
|         | 64         | 624.81 tokens/secs | 623.65 tokens/secs | 625.66 tokens/secs |
|         | 128        | 970.37 tokens/secs | 969.79 tokens/secs | 971.62 tokens/secs |

so need 480G vram?
i have only 4
22G vram...

add --num-shard 4
then tgi is ok

so need 480G vram?
i have only 4
22G vram...

It needs less than 200GB vram to load the model. If more batches and longer sequences are needed then the rest of the memory can be expanded by TGI if you allowed it via cuda_fraction.

vllm recently added support for 2-bit gptq quantization, any chance it will run on 24gb vram in 2-bits? afaik GGUF and EXL can fit, but are slow

@ceoofcapybaras Can the 4-bit GPTQ be automatically converted to 2-bit in vLLM or do I have to quantized in GPTQ for 2-bit? (I've never tried it in autogptq to be honest, must be new)

Late reply, but @ceoofcapybaras if you need 2-bit, try https://huggingface.co/AlexWortega/miqu-1-70b-AQLM-2Bit-1x16-hf with --quantization aqlm in vllm. Works well in my personal evals, and easily fits on a single 3090/4090. It runs at about 8 tokens per second for a simple prompt like "write a story about X" (i.e. no prefill, batch size 1).

Seems to live up to its "SoTA 2-bit quantization" claim - at least relative to exl2, which is unusable (quality-wise) at 2 bits.

Sign up or log in to comment