mgoin commited on
Commit
025c0c1
·
verified ·
1 Parent(s): 1aa2e4c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -2
README.md CHANGED
@@ -40,12 +40,11 @@ Only weights and activations of the linear operators within transformers blocks
40
  Weights are quantized with a symmetric static per-channel scheme, where a fixed linear scaling factor is applied between FP8 and BF16 representations for each output channel dimension.
41
  Linear scaling factors are computed via by minimizing the mean squarred error (MSE).
42
  Activations are quantized with a symmetric dynamic per-token scheme, computing a linear scaling factor at runtime for each token between FP8 and BF16 representations.
43
- The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
44
 
45
 
46
  ## Deployment with vLLM
47
 
48
- This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend. vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
49
 
50
 
51
  ## Evaluation
 
40
  Weights are quantized with a symmetric static per-channel scheme, where a fixed linear scaling factor is applied between FP8 and BF16 representations for each output channel dimension.
41
  Linear scaling factors are computed via by minimizing the mean squarred error (MSE).
42
  Activations are quantized with a symmetric dynamic per-token scheme, computing a linear scaling factor at runtime for each token between FP8 and BF16 representations.
 
43
 
44
 
45
  ## Deployment with vLLM
46
 
47
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend. vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
48
 
49
 
50
  ## Evaluation