Remek commited on
Commit
6592e4c
1 Parent(s): 1b35365

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -33
README.md CHANGED
@@ -16,12 +16,15 @@ pipeline_tag: text-generation
16
 
17
  # Bielik-11B-v2.2-Instruct-W8A8
18
 
19
- This model was obtained by quantizing the weights and activations of [Bielik-11B-v.2.2-Instruct](https://huggingface.co/speakleash/Bielik-11B-v2.2-Instruct) to W8A8 (int8) data type, ready for inference with vLLM >= 0.5.0.
20
- AutoFP8 is used for quantization. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
21
- Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-tensor quantization is applied, in which a single linear scaling maps the FP8 representations of the quantized weights and activations.
22
-
23
- FP8 compuation is supported on Nvidia GPUs with compute capability > 8.9 (Ada Lovelace, Hopper).
24
 
 
 
 
 
 
25
 
26
  ## Use with vLLM
27
 
@@ -31,7 +34,7 @@ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/
31
  from vllm import LLM, SamplingParams
32
  from transformers import AutoTokenizer
33
 
34
- model_id = "speakleash/Bielik-11B-v2.2-Instruct-FP8"
35
 
36
  sampling_params = SamplingParams(temperature=0.2, top_p=0.95, max_tokens=4096)
37
 
@@ -55,33 +58,6 @@ print(generated_text)
55
  vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
56
 
57
 
58
- ## Use with SGLang Runtime
59
- Launch a server of SGLang Runtime:
60
-
61
- ```
62
- python -m sglang.launch_server --model-path speakleash/Bielik-11B-v2.2-Instruct-FP8 --port 30000
63
- ```
64
-
65
- Then you can send http request or use OpenAI Compatible API.
66
-
67
- ```python
68
- import openai
69
- client = openai.Client(
70
- base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")
71
-
72
- response = client.chat.completions.create(
73
- model="default",
74
- messages=[
75
- {"role": "system", "content": "Jesteś pomocnym asystentem Bielik."},
76
- {"role": "user", "content": "Kim był Mikołaj Kopernik i z czego zasłynął?"},
77
- ],
78
- temperature=0,
79
- max_tokens=4096,
80
- )
81
- print(response)
82
-
83
- ```
84
-
85
  ### Model description:
86
 
87
  * **Developed by:** [SpeakLeash](https://speakleash.org/) & [ACK Cyfronet AGH](https://www.cyfronet.pl/)
 
16
 
17
  # Bielik-11B-v2.2-Instruct-W8A8
18
 
19
+ This model was obtained by quantizing the weights and activations of [Bielik-11B-v.2.2-Instruct](https://huggingface.co/speakleash/Bielik-11B-v2.2-Instruct) to W8A8 (INT8) data type, ready for inference with vLLM >= 0.5.0.
20
+ This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
21
+ Weight quantization also reduces disk size requirements by approximately 50%.
 
 
22
 
23
+ Only weights and activations of the linear operators within transformers blocks are quantized.
24
+ Weights are quantized with a symmetric static per-channel scheme, where a fixed linear scaling factor is applied between INT8 and floating point representations for each output channel dimension.
25
+ Activations are quantized with a symmetric dynamic per-token scheme, computing a linear scaling factor at runtime for each token between INT8 and floating point representations.
26
+ Linear scaling factors are computed via by minimizing the mean squarred error (MSE). The SmoothQuant algorithm is used to alleviate outliers in the activations, whereas rhe GPTQ algorithm is applied for quantization.
27
+ Both algorithms are implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
28
 
29
  ## Use with vLLM
30
 
 
34
  from vllm import LLM, SamplingParams
35
  from transformers import AutoTokenizer
36
 
37
+ model_id = "speakleash/Bielik-11B-v2.2-Instruct-W8A8"
38
 
39
  sampling_params = SamplingParams(temperature=0.2, top_p=0.95, max_tokens=4096)
40
 
 
58
  vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
59
 
60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
  ### Model description:
62
 
63
  * **Developed by:** [SpeakLeash](https://speakleash.org/) & [ACK Cyfronet AGH](https://www.cyfronet.pl/)