TheBloke commited on
Commit
3bec295
1 Parent(s): a0ffe42

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -65
README.md CHANGED
@@ -45,11 +45,12 @@ This repo contains AWQ model files for [Mistral AI's Mistral 7B v0.1](https://hu
45
 
46
  AWQ is an efficient, accurate and blazing-fast low-bit weight quantization method, currently supporting 4-bit quantization. Compared to GPTQ, it offers faster Transformers-based inference.
47
 
48
- It is also now supported by continuous batching server [vLLM](https://github.com/vllm-project/vllm), allowing use of Llama AWQ models for high-throughput concurrent inference in multi-user server scenarios.
49
 
50
- As of September 25th 2023, preliminary Llama-only AWQ support has also been added to [Huggingface Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference).
 
 
51
 
52
- Note that, at the time of writing, overall throughput is still lower than running vLLM or TGI with unquantised models, however using AWQ enables using much smaller GPUs which can lead to easier deployment and overall cost savings. For example, a 70B model can be run on 1 x 48GB GPU instead of 2 x 80GB.
53
  <!-- description end -->
54
  <!-- repositories-available start -->
55
  ## Repositories available
@@ -64,7 +65,6 @@ Note that, at the time of writing, overall throughput is still lower than runnin
64
 
65
  ```
66
  {prompt}
67
-
68
  ```
69
 
70
  <!-- prompt-template end -->
@@ -83,74 +83,23 @@ Models are released as sharded safetensors files.
83
 
84
  <!-- README_AWQ.md-provided-files end -->
85
 
86
- <!-- README_AWQ.md-use-from-vllm start -->
87
- ## Serving this model from vLLM
88
-
89
- Documentation on installing and using vLLM [can be found here](https://vllm.readthedocs.io/en/latest/).
90
-
91
- - When using vLLM as a server, pass the `--quantization awq` parameter, for example:
92
-
93
- ```shell
94
- python3 python -m vllm.entrypoints.api_server --model TheBloke/Mistral-7B-v0.1-AWQ --quantization awq --dtype half
95
- ```
96
-
97
- Note: at the time of writing, vLLM has not yet done a new release with support for the `quantization` parameter.
98
-
99
- If you try the code below and get an error about `quantization` being unrecognised, please install vLLM from Github source.
100
-
101
- When using vLLM from Python code, pass the `quantization=awq` parameter, for example:
102
-
103
- ```python
104
- from vllm import LLM, SamplingParams
105
-
106
- prompts = [
107
- "Hello, my name is",
108
- "The president of the United States is",
109
- "The capital of France is",
110
- "The future of AI is",
111
- ]
112
- sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
113
-
114
- llm = LLM(model="TheBloke/Mistral-7B-v0.1-AWQ", quantization="awq", dtype="half")
115
-
116
- outputs = llm.generate(prompts, sampling_params)
117
-
118
- # Print the outputs.
119
- for output in outputs:
120
- prompt = output.prompt
121
- generated_text = output.outputs[0].text
122
- print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
123
- ```
124
- <!-- README_AWQ.md-use-from-vllm start -->
125
 
126
  <!-- README_AWQ.md-use-from-python start -->
127
- ## Serving this model from TGI
128
-
129
- TGI merged support for AWQ on September 25th, 2023. At the time of writing you need to use the `:latest` Docker container: `ghcr.io/huggingface/text-generation-inference:latest`
130
-
131
- Add the parameter `--quantize awq` for AWQ support.
132
-
133
- Example parameters:
134
- ```shell
135
- --model-id TheBloke/Mistral-7B-v0.1-AWQ --port 3000 --quantize awq --max-input-length 3696 --max-total-tokens 4096 --max-batch-prefill-tokens 4096
136
- ```
137
-
138
  ## How to use this AWQ model from Python code
139
 
140
  ### Install the necessary packages
141
 
142
- Requires: [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) 0.0.2 or later
143
 
144
- ```shell
145
- pip3 install autoawq
146
- ```
147
-
148
- If you have problems installing [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) using the pre-built wheels, install it from source instead:
149
 
150
  ```shell
151
- pip3 uninstall -y autoawq
 
152
  git clone https://github.com/casper-hansen/AutoAWQ
153
  cd AutoAWQ
 
154
  pip3 install .
155
  ```
156
 
@@ -220,10 +169,6 @@ print(pipe(prompt_template)[0]['generated_text'])
220
  The files provided are tested to work with:
221
 
222
  - [AutoAWQ](https://github.com/casper-hansen/AutoAWQ)
223
- - [vLLM](https://github.com/vllm-project/vllm)
224
- - [Huggingface Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference)
225
-
226
- TGI merged AWQ support on September 25th, 2023: [TGI PR #1054](https://github.com/huggingface/text-generation-inference/pull/1054). Use the `:latest` Docker container until the next TGI release is made.
227
 
228
  <!-- README_AWQ.md-compatibility end -->
229
 
 
45
 
46
  AWQ is an efficient, accurate and blazing-fast low-bit weight quantization method, currently supporting 4-bit quantization. Compared to GPTQ, it offers faster Transformers-based inference.
47
 
48
+ ### Mistral AWQs
49
 
50
+ These are experimental first AWQs for the brand-new model format, Mistral.
51
+
52
+ They will not work from vLLM or TGI. They can only be used from AutoAWQ, and they require installing both AutoAWQ and Transformers from Github. More details are below.
53
 
 
54
  <!-- description end -->
55
  <!-- repositories-available start -->
56
  ## Repositories available
 
65
 
66
  ```
67
  {prompt}
 
68
  ```
69
 
70
  <!-- prompt-template end -->
 
83
 
84
  <!-- README_AWQ.md-provided-files end -->
85
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86
 
87
  <!-- README_AWQ.md-use-from-python start -->
 
 
 
 
 
 
 
 
 
 
 
88
  ## How to use this AWQ model from Python code
89
 
90
  ### Install the necessary packages
91
 
92
+ Requires:
93
 
94
+ - Transformers from [commit 72958fcd3c98a7afdc61f953aa58c544ebda2f79](https://github.com/huggingface/transformers/commit/72958fcd3c98a7afdc61f953aa58c544ebda2f79)
95
+ - [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) from [PR #79](https://github.com/casper-hansen/AutoAWQ/pull/79).
 
 
 
96
 
97
  ```shell
98
+ pip3 install git+https://github.com/huggingface/transformers.git@72958fcd3c98a7afdc61f953aa58c544ebda2f79
99
+
100
  git clone https://github.com/casper-hansen/AutoAWQ
101
  cd AutoAWQ
102
+ git checkout mistral
103
  pip3 install .
104
  ```
105
 
 
169
  The files provided are tested to work with:
170
 
171
  - [AutoAWQ](https://github.com/casper-hansen/AutoAWQ)
 
 
 
 
172
 
173
  <!-- README_AWQ.md-compatibility end -->
174