Imran1 commited on
Commit
75dfc1d
1 Parent(s): 8a2541a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +66 -3
README.md CHANGED
@@ -1,3 +1,66 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+
6
+ # Imran1/Qwen2.5-72B-Instruct-FP8
7
+
8
+ ## Overview
9
+ **Imran1/Qwen2.5-72B-Instruct-FP8** is an optimized version of the base model **Qwen2.5-72B-Instruct**, utilizing **FP8** (8-bit floating point) precision. This reduces memory usage and increases computational efficiency, making it ideal for large-scale inference tasks without sacrificing the model's performance.
10
+
11
+ This model is well-suited for applications such as:
12
+ - Conversational AI and chatbots
13
+ - Instruction-based tasks
14
+ - Text generation, summarization, and dialogue completion
15
+
16
+ ## Key Features
17
+ - **72 billion parameters** for powerful language generation and understanding capabilities.
18
+ - **FP8 precision** for reduced memory consumption and faster inference.
19
+ - Supports **tensor parallelism** for distributed computing environments.
20
+
21
+ ## Usage Instructions
22
+
23
+ ### 1. Running the Model with vLLM
24
+ You can serve the model using **vLLM** with tensor parallelism enabled. Below is an example command for running the model:
25
+
26
+ ```bash
27
+ vllm serve Imran1/Qwen2.5-72B-Instruct-FP8 --api-key token-abc123 --tensor-parallel-size 2
28
+ ```
29
+
30
+ ### 2. Interacting with the Model via Python (OpenAI API)
31
+ Here’s an example of how to interact with the model using the OpenAI API interface:
32
+
33
+ ```python
34
+ from openai import OpenAI
35
+
36
+ client = OpenAI(
37
+ base_url="http://localhost:8000/v1", # Your vLLM server URL
38
+ api_key="token-abc123", # Replace with your API key
39
+ )
40
+
41
+ # Example chat completion request
42
+ completion = client.chat.completions.create(
43
+ model="Imran1/Qwen2.5-72B-Instruct-FP8",
44
+ messages=[
45
+ {"role": "user", "content": "Hello!"},
46
+ ],
47
+ max_tokens=500,
48
+ stream=True
49
+ )
50
+
51
+ print(completion)
52
+ ```
53
+
54
+ ## Performance and Efficiency
55
+ - **Memory Efficiency**: FP8 precision significantly reduces memory requirements, allowing for larger batch sizes and faster processing times.
56
+ - **Speed**: The FP8 version provides faster inference, making it highly suitable for real-time applications.
57
+
58
+ ## Limitations
59
+ - **Precision Trade-offs**: While FP8 enhances speed and memory usage, tasks that require high precision (e.g., numerical calculations) may see a slight performance degradation compared to FP16/FP32 versions.
60
+
61
+ ## License
62
+ This model is licensed under the [Apache-2.0](LICENSE) license. Feel free to use this model for both commercial and non-commercial purposes, ensuring compliance with the license terms.
63
+
64
+ ---
65
+
66
+ For more details and updates, visit the [model page on Hugging Face](https://huggingface.co/Imran1/Qwen2.5-72B-Instruct-FP8).