reach-vb HF staff alvarobartt HF staff commited on
Commit
04c4b2d
·
verified ·
1 Parent(s): 51de55c

Update README.md (#5)

Browse files

- Update README.md (57492d6f08fe62476c6c2cb0469d2e208e87054e)
- Update README.md (59af7ca3372c1acd5330091322bbeca6fd4bd6ce)


Co-authored-by: Alvaro Bartolome <[email protected]>

Files changed (1) hide show
  1. README.md +136 -1
README.md CHANGED
@@ -122,7 +122,142 @@ The AutoGPTQ script has been adapted from [`AutoGPTQ/examples/quantization/basic
122
 
123
  ### 🤗 Text Generation Inference (TGI)
124
 
125
- Coming soon!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
126
 
127
  ## Quantization Reproduction
128
 
 
122
 
123
  ### 🤗 Text Generation Inference (TGI)
124
 
125
+ To run the `text-generation-launcher` with Llama 3.1 405B Instruct GPTQ in INT4 with Marlin kernels for optimized inference speed, you will need to have Docker installed (see [installation notes](https://docs.docker.com/engine/install/)) and the `huggingface_hub` Python package as you need to login to the Hugging Face Hub.
126
+
127
+ ```bash
128
+ pip install -q --upgrade huggingface_hub
129
+ huggingface-cli login
130
+ ```
131
+
132
+ Then you just need to run the TGI v2.2.0 (or higher) Docker container as follows:
133
+
134
+ ```bash
135
+ docker run --gpus all --shm-size 1g -ti -p 8080:80 \
136
+ -v hf_cache:/data \
137
+ -e MODEL_ID=hugging-quants/Meta-Llama-3.1-405B-Instruct-GPTQ-INT4 \
138
+ -e NUM_SHARD=8 \
139
+ -e QUANTIZE=gptq \
140
+ -e HF_TOKEN=$(cat ~/.cache/huggingface/token) \
141
+ -e MAX_INPUT_LENGTH=4000 \
142
+ -e MAX_TOTAL_TOKENS=4096 \
143
+ ghcr.io/huggingface/text-generation-inference:2.2.0
144
+ ```
145
+
146
+ > [!NOTE]
147
+ > TGI will expose different endpoints, to see all the endpoints available check [TGI OpenAPI Specification](https://huggingface.github.io/text-generation-inference/#/).
148
+
149
+ To send request to the deployed TGI endpoint compatible with [OpenAI OpenAPI specification](https://github.com/openai/openai-openapi) i.e. `/v1/chat/completions`:
150
+
151
+ ```bash
152
+ curl 0.0.0.0:8080/v1/chat/completions \
153
+ -X POST \
154
+ -H 'Content-Type: application/json' \
155
+ -d '{
156
+ "model": "tgi",
157
+ "messages": [
158
+ {
159
+ "role": "system",
160
+ "content": "You are a helpful assistant."
161
+ },
162
+ {
163
+ "role": "user",
164
+ "content": "What is Deep Learning?"
165
+ }
166
+ ],
167
+ "max_tokens": 128
168
+ }'
169
+ ```
170
+
171
+ Or programatically via the `huggingface_hub` Python client as follows:
172
+
173
+ ```python
174
+ import os
175
+ from huggingface_hub import InferenceClient
176
+
177
+ client = InferenceClient(base_url="http://0.0.0.0:8080", api_key=os.getenv("HF_TOKEN", "-"))
178
+
179
+ chat_completion = client.chat.completions.create(
180
+ model="hugging-quants/Meta-Llama-3.1-405B-Instruct-GPTQ-INT4",
181
+ messages=[
182
+ {"role": "system", "content": "You are a helpful assistant."},
183
+ {"role": "user", "content": "What is Deep Learning?"},
184
+ ],
185
+ max_tokens=128,
186
+ )
187
+ ```
188
+
189
+ Alternatively, the OpenAI Python client can also be used (see [installation notes](https://github.com/openai/openai-python?tab=readme-ov-file#installation)) as follows:
190
+
191
+ ```python
192
+ import os
193
+ from openai import OpenAI
194
+
195
+ client = OpenAI(base_url="http://0.0.0.0:8080/v1", api_key=os.getenv("OPENAI_API_KEY", "-"))
196
+
197
+ chat_completion = client.chat.completions.create(
198
+ model="tgi",
199
+ messages=[
200
+ {"role": "system", "content": "You are a helpful assistant."},
201
+ {"role": "user", "content": "What is Deep Learning?"},
202
+ ],
203
+ max_tokens=128,
204
+ )
205
+ ```
206
+
207
+ ### vLLM
208
+
209
+ To run vLLM with Llama 3.1 405B Instruct GPTQ in INT4, you will need to have Docker installed (see [installation notes](https://docs.docker.com/engine/install/)) and run the latest vLLM Docker container as follows:
210
+
211
+ ```bash
212
+ docker run --runtime nvidia --gpus all --ipc=host -p 8000:8000 \
213
+ -v hf_cache:/root/.cache/huggingface \
214
+ vllm/vllm-openai:latest \
215
+ --model hugging-quants/Meta-Llama-3.1-405B-Instruct-GPTQ-INT4 \
216
+ --quantization gptq_marlin \
217
+ --tensor-parallel-size 8 \
218
+ --max-model-len 4096
219
+ ```
220
+
221
+ To send request to the deployed vLLM endpoint compatible with [OpenAI OpenAPI specification](https://github.com/openai/openai-openapi) i.e. `/v1/chat/completions`:
222
+
223
+ ```bash
224
+ curl 0.0.0.0:8000/v1/chat/completions \
225
+ -X POST \
226
+ -H 'Content-Type: application/json' \
227
+ -d '{
228
+ "model": "hugging-quants/Meta-Llama-3.1-405B-Instruct-GPTQ-INT4",
229
+ "messages": [
230
+ {
231
+ "role": "system",
232
+ "content": "You are a helpful assistant."
233
+ },
234
+ {
235
+ "role": "user",
236
+ "content": "What is Deep Learning?"
237
+ }
238
+ ],
239
+ "max_tokens": 128
240
+ }'
241
+ ```
242
+
243
+ Or programatically via the `openai` Python client (see [installation notes](https://github.com/openai/openai-python?tab=readme-ov-file#installation)) as follows:
244
+
245
+ ```python
246
+ import os
247
+ from openai import OpenAI
248
+
249
+ client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key=os.getenv("VLLM_API_KEY", "-"))
250
+
251
+ chat_completion = client.chat.completions.create(
252
+ model="hugging-quants/Meta-Llama-3.1-405B-Instruct-GPTQ-INT4",
253
+ messages=[
254
+ {"role": "system", "content": "You are a helpful assistant."},
255
+ {"role": "user", "content": "What is Deep Learning?"},
256
+ ],
257
+ max_tokens=128,
258
+ )
259
+ ```
260
+
261
 
262
  ## Quantization Reproduction
263