feihu.hf
commited on
Commit
•
d2a58d1
1
Parent(s):
d30e8dd
update README.md
Browse files
README.md
CHANGED
@@ -130,6 +130,12 @@ Or you can install vLLM from [source](https://github.com/vllm-project/vllm/).
|
|
130 |
|
131 |
**Note**: Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts**. We advise adding the `rope_scaling` configuration only when processing long contexts is required.
|
132 |
|
|
|
|
|
|
|
|
|
|
|
|
|
133 |
## Citation
|
134 |
|
135 |
If you find our work helpful, feel free to give us a cite.
|
|
|
130 |
|
131 |
**Note**: Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts**. We advise adding the `rope_scaling` configuration only when processing long contexts is required.
|
132 |
|
133 |
+
## Benchmark and Speed
|
134 |
+
|
135 |
+
To compare the generation performance between bfloat16 (bf16) and quantized models such as GPTQ-Int8, GPTQ-Int4, and AWQ, please consult our [Benchmark of Quantized Models](https://qwen.readthedocs.io/en/latest/benchmark/quantization_benchmark.html). This benchmark provides insights into how different quantization techniques affect model performance.
|
136 |
+
|
137 |
+
For those interested in understanding the inference speed and memory consumption when deploying these models with either ``transformer`` or ``vLLM``, we have compiled an extensive [Speed Benchmark](https://qwen.readthedocs.io/en/latest/benchmark/speed_benchmark.html).
|
138 |
+
|
139 |
## Citation
|
140 |
|
141 |
If you find our work helpful, feel free to give us a cite.
|