Qwen
/

Qwen2-72B-Instruct

@@ -79,22 +79,22 @@ For deployment, we recommend using vLLM. You can enable long-context capabilitie
 1. **Install vLLM**: Ensure you have the latest version from the main branch of [vLLM](https://github.com/vllm-project/vllm).
 2. **Configure Model Settings**: After downloading the model weights, modify the `config.json` file by including the below snippet:
-     ```json5
         {
-        "architectures": [
-            "Qwen2ForCausalLM"
-        ],
-        // ...
-        "vocab_size": 152064,
-        // adding the following snippets
-        "rope_scaling": {
-            "factor": 4.0,
-            "original_max_position_embeddings": 32768,
-            "type": "yarn"
         }
-        }
-     ```
     This snippet enable YARN to support longer contexts.
 3. **Model Deployment**: Utilize vLLM to deploy your model. For instance, you can set up an openAI-like server using the command:
@@ -111,15 +111,15 @@ For deployment, we recommend using vLLM. You can enable long-context capabilitie
         -d '{
         "model": "Qwen2-72B-Instruct",
         "messages": [
-        {"role": "system", "content": "You are a helpful assistant."},
-        {"role": "user", "content": "Your Long Input Here."}
         ]
         }'
     ```
     For further usage instructions of vLLM, please refer to our [Github](https://github.com/QwenLM/Qwen2).
-**Note**: Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the `rope_scaling` configuration only when processing long contexts is required.
 ## Citation

 1. **Install vLLM**: Ensure you have the latest version from the main branch of [vLLM](https://github.com/vllm-project/vllm).
 2. **Configure Model Settings**: After downloading the model weights, modify the `config.json` file by including the below snippet:
+    ```json
         {
+            "architectures": [
+                "Qwen2ForCausalLM"
+            ],
+            // ...
+            "vocab_size": 152064,
+            // adding the following snippets
+            "rope_scaling": {
+                "factor": 4.0,
+                "original_max_position_embeddings": 32768,
+                "type": "yarn"
+            }
         }
+    ```
     This snippet enable YARN to support longer contexts.
 3. **Model Deployment**: Utilize vLLM to deploy your model. For instance, you can set up an openAI-like server using the command:
         -d '{
         "model": "Qwen2-72B-Instruct",
         "messages": [
+            {"role": "system", "content": "You are a helpful assistant."},
+            {"role": "user", "content": "Your Long Input Here."}
         ]
         }'
     ```
     For further usage instructions of vLLM, please refer to our [Github](https://github.com/QwenLM/Qwen2).
+**Note**: Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts**. We advise adding the `rope_scaling` configuration only when processing long contexts is required.
 ## Citation