Update README.md
Browse files
README.md
CHANGED
@@ -40,12 +40,11 @@ Only weights and activations of the linear operators within transformers blocks
|
|
40 |
Weights are quantized with a symmetric static per-channel scheme, where a fixed linear scaling factor is applied between FP8 and BF16 representations for each output channel dimension.
|
41 |
Linear scaling factors are computed via by minimizing the mean squarred error (MSE).
|
42 |
Activations are quantized with a symmetric dynamic per-token scheme, computing a linear scaling factor at runtime for each token between FP8 and BF16 representations.
|
43 |
-
The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
|
44 |
|
45 |
|
46 |
## Deployment with vLLM
|
47 |
|
48 |
-
This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend. vLLM
|
49 |
|
50 |
|
51 |
## Evaluation
|
|
|
40 |
Weights are quantized with a symmetric static per-channel scheme, where a fixed linear scaling factor is applied between FP8 and BF16 representations for each output channel dimension.
|
41 |
Linear scaling factors are computed via by minimizing the mean squarred error (MSE).
|
42 |
Activations are quantized with a symmetric dynamic per-token scheme, computing a linear scaling factor at runtime for each token between FP8 and BF16 representations.
|
|
|
43 |
|
44 |
|
45 |
## Deployment with vLLM
|
46 |
|
47 |
+
This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend. vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
|
48 |
|
49 |
|
50 |
## Evaluation
|