feihu.hf commited on
Commit
d2a58d1
1 Parent(s): d30e8dd

update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -0
README.md CHANGED
@@ -130,6 +130,12 @@ Or you can install vLLM from [source](https://github.com/vllm-project/vllm/).
130
 
131
  **Note**: Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts**. We advise adding the `rope_scaling` configuration only when processing long contexts is required.
132
 
 
 
 
 
 
 
133
  ## Citation
134
 
135
  If you find our work helpful, feel free to give us a cite.
 
130
 
131
  **Note**: Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts**. We advise adding the `rope_scaling` configuration only when processing long contexts is required.
132
 
133
+ ## Benchmark and Speed
134
+
135
+ To compare the generation performance between bfloat16 (bf16) and quantized models such as GPTQ-Int8, GPTQ-Int4, and AWQ, please consult our [Benchmark of Quantized Models](https://qwen.readthedocs.io/en/latest/benchmark/quantization_benchmark.html). This benchmark provides insights into how different quantization techniques affect model performance.
136
+
137
+ For those interested in understanding the inference speed and memory consumption when deploying these models with either ``transformer`` or ``vLLM``, we have compiled an extensive [Speed Benchmark](https://qwen.readthedocs.io/en/latest/benchmark/speed_benchmark.html).
138
+
139
  ## Citation
140
 
141
  If you find our work helpful, feel free to give us a cite.