bofenghuang
/

vigogne-2-7b-chat

@@ -28,18 +28,9 @@ All previous versions are accessible through branches.
 - **V1.0**: Trained on 420K chat data.
 - **V2.0**: Trained on 520K data. Check out our [release blog](https://github.com/bofenghuang/vigogne/blob/main/blogs/2023-08-17-vigogne-chat-v2_0.md) for more details.
-## Quantized Models
-The quantized versions of this model are generously provided by [TheBloke](https://huggingface.co/TheBloke)!
-- AWQ: [TheBloke/Vigogne-2-7B-Chat-AWQ](https://huggingface.co/TheBloke/Vigogne-2-7B-Chat-AWQ)
-- GTPQ: [TheBloke/Vigogne-2-7B-Chat-GPTQ](https://huggingface.co/TheBloke/Vigogne-2-7B-Chat-GPTQ)
-- GGUF: [TheBloke/Vigogne-2-7B-Chat-GGUF](https://huggingface.co/TheBloke/Vigogne-2-7B-Chat-GGUF)
 ## Prompt Template
-We utilized prefix tokens `<user>` and `<assistant>` to distinguish between user and assistant utterances.
 You can apply this formatting using the [chat template](https://huggingface.co/docs/transformers/main/chat_templating) through the `apply_chat_template()` method.
@@ -73,6 +64,18 @@ You will get
 ## Usage
 ```python
 from typing import Dict, List, Optional
 import torch
@@ -139,10 +142,51 @@ response, history = chat("Quand il peut dépasser le lapin ?", history=history)
 response, history = chat("Écris une histoire imaginative qui met en scène une compétition de course entre un escargot et un lapin.", history=history)
 ```
-You can also utilize the Google Colab Notebook below for inferring with the Vigogne chat models.
 <a href="https://colab.research.google.com/github/bofenghuang/vigogne/blob/main/notebooks/infer_chat.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
 ## Limitations
 Vigogne is still under development, and there are many limitations that have to be addressed. Please note that it is possible that the model generates harmful or biased content, incorrect information or generally unhelpful answers.

 - **V1.0**: Trained on 420K chat data.
 - **V2.0**: Trained on 520K data. Check out our [release blog](https://github.com/bofenghuang/vigogne/blob/main/blogs/2023-08-17-vigogne-chat-v2_0.md) for more details.
 ## Prompt Template
+We utilized prefix tokens `<user>:` and `<assistant>:` to distinguish between user and assistant utterances.
 You can apply this formatting using the [chat template](https://huggingface.co/docs/transformers/main/chat_templating) through the `apply_chat_template()` method.
 ## Usage
+### Inference using the quantized versions
+The quantized versions of this model are generously provided by [TheBloke](https://huggingface.co/TheBloke)!
+- AWQ for GPU inference: [TheBloke/Vigogne-2-7B-Chat-AWQ](https://huggingface.co/TheBloke/Vigogne-2-7B-Chat-AWQ)
+- GTPQ for GPU inference: [TheBloke/Vigogne-2-7B-Chat-GPTQ](https://huggingface.co/TheBloke/Vigogne-2-7B-Chat-GPTQ)
+- GGUF for CPU+GPU inference: [TheBloke/Vigogne-2-7B-Chat-GGUF](https://huggingface.co/TheBloke/Vigogne-2-7B-Chat-GGUF)
+These versions facilitate testing and development with various popular frameworks, including [AutoAWQ](https://github.com/casper-hansen/AutoAWQ), [vLLM](https://github.com/vllm-project/vllm), [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa), [llama.cpp](https://github.com/ggerganov/llama.cpp), [text-generation-webui](https://github.com/oobabooga/text-generation-webui), and more.
+### Inference using the unquantized model with 🤗 Transformers
 ```python
 from typing import Dict, List, Optional
 import torch
 response, history = chat("Écris une histoire imaginative qui met en scène une compétition de course entre un escargot et un lapin.", history=history)
 ```
+You can also use the Google Colab Notebook provided below.
 <a href="https://colab.research.google.com/github/bofenghuang/vigogne/blob/main/notebooks/infer_chat.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
+### Inference using the unquantized model with vLLM
+Set up an OpenAI-compatible server with the following command:
+```bash
+# Install vLLM
+# This may take 5-10 minutes.
+# pip install vllm
+# Start server for Vigogne-Chat models
+python -m vllm.entrypoints.openai.api_server --model bofenghuang/vigogne-2-7b-chat
+# List models
+# curl http://localhost:8000/v1/models
+```
+Query the model using the openai python package.
+```python
+import openai
+# Modify OpenAI's API key and API base to use vLLM's API server.
+openai.api_key = "EMPTY"
+openai.api_base = "http://localhost:8000/v1"
+# First model
+models = openai.Model.list()
+model = models["data"][0]["id"]
+# Chat completion API
+chat_completion = openai.ChatCompletion.create(
+    model=model,
+    messages=[
+        {"role": "user", "content": "Parle-moi de toi-même."},
+    ],
+    max_tokens=1024,
+    temperature=0.7,
+)
+print("Chat completion results:", chat_completion)
+```
 ## Limitations
 Vigogne is still under development, and there are many limitations that have to be addressed. Please note that it is possible that the model generates harmful or biased content, incorrect information or generally unhelpful answers.