speakleash
/

Bielik-11B-v2.2-Instruct-W8A8

@@ -16,12 +16,15 @@ pipeline_tag: text-generation
 # Bielik-11B-v2.2-Instruct-W8A8
-This model was obtained by quantizing the weights and activations of [Bielik-11B-v.2.2-Instruct](https://huggingface.co/speakleash/Bielik-11B-v2.2-Instruct) to W8A8 (int8) data type, ready for inference with vLLM >= 0.5.0.
-AutoFP8 is used for quantization. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
-Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-tensor quantization is applied, in which a single linear scaling maps the FP8 representations of the quantized weights and activations.
-FP8 compuation is supported on Nvidia GPUs with compute capability > 8.9 (Ada Lovelace, Hopper).
 ## Use with vLLM
@@ -31,7 +34,7 @@ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/
 from vllm import LLM, SamplingParams
 from transformers import AutoTokenizer
-model_id = "speakleash/Bielik-11B-v2.2-Instruct-FP8"
 sampling_params = SamplingParams(temperature=0.2, top_p=0.95, max_tokens=4096)
@@ -55,33 +58,6 @@ print(generated_text)
 vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
-## Use with SGLang Runtime
-Launch a server of SGLang Runtime:
-```
-python -m sglang.launch_server --model-path speakleash/Bielik-11B-v2.2-Instruct-FP8 --port 30000
-```
-Then you can send http request or use OpenAI Compatible API.
-```python
-import openai
-client = openai.Client(
- base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")
-response = client.chat.completions.create(
- model="default",
- messages=[
- {"role": "system", "content": "Jesteś pomocnym asystentem Bielik."},
- {"role": "user", "content": "Kim był Mikołaj Kopernik i z czego zasłynął?"},
- ],
- temperature=0,
- max_tokens=4096,
-)
-print(response)
-```
 ### Model description:
 * **Developed by:** [SpeakLeash](https://speakleash.org/) & [ACK Cyfronet AGH](https://www.cyfronet.pl/)

 # Bielik-11B-v2.2-Instruct-W8A8
+This model was obtained by quantizing the weights and activations of [Bielik-11B-v.2.2-Instruct](https://huggingface.co/speakleash/Bielik-11B-v2.2-Instruct) to W8A8 (INT8) data type, ready for inference with vLLM >= 0.5.0.
+This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
+Weight quantization also reduces disk size requirements by approximately 50%.
+Only weights and activations of the linear operators within transformers blocks are quantized.
+Weights are quantized with a symmetric static per-channel scheme, where a fixed linear scaling factor is applied between INT8 and floating point representations for each output channel dimension.
+Activations are quantized with a symmetric dynamic per-token scheme, computing a linear scaling factor at runtime for each token between INT8 and floating point representations.
+Linear scaling factors are computed via by minimizing the mean squarred error (MSE). The SmoothQuant algorithm is used to alleviate outliers in the activations, whereas rhe GPTQ algorithm is applied for quantization.
+Both algorithms are implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
 ## Use with vLLM
 from vllm import LLM, SamplingParams
 from transformers import AutoTokenizer
+model_id = "speakleash/Bielik-11B-v2.2-Instruct-W8A8"
 sampling_params = SamplingParams(temperature=0.2, top_p=0.95, max_tokens=4096)
 vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
 ### Model description:
 * **Developed by:** [SpeakLeash](https://speakleash.org/) & [ACK Cyfronet AGH](https://www.cyfronet.pl/)