Model Name: Qwen2 orca_mini_v7_7b-AWQ
orca_mini_v7_7b-AWQ is AWQ Quantize version of orca_mini_v7_7b model.
Passionate about Generative AI? I help companies to privately train and deploy custom LLM/MLLM affordably. For startups, I can even assist with securing GPU grants to get you started. Let's chat!https://www.linkedin.com/in/pankajam Looking forward to connecting!
Example Usage
Here is the ChatML prompt format
<|im_start|>system
You are Orca Mini, a helpful AI assistant.<|im_end|>
<|im_start|>user
Hello Orca Mini, what can you do for me?<|im_end|>
<|im_start|>assistant
Below shows a code example on how to use this model
from transformers import AutoModel, AutoTokenizer
model_slug = "pankajmathur/orca_mini_v7_7b-AWQ"
model = AutoModel.from_pretrained(model_slug)
tokenizer = AutoTokenizer.from_pretrained(model_slug)
messages = [
{"role": "system", "content": "You are Orca Mini, a helpful AI assistant."},
{"role": "user", "content": "Hello Orca Mini, what can you do for me?"}
]
gen_input = tokenizer.apply_chat_template(messages, return_tensors="pt")
model.generate(**gen_input)
Processing Long Texts (Based upon Qwen2-7B-Instruct suggestions at https://huggingface.co/Qwen/Qwen2-7B-Instruct)
To handle extensive inputs exceeding 32,768 tokens, we utilize YARN, a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts.
For deployment, we recommend using vLLM. You can enable the long-context capabilities by following these steps:
- Install vLLM: You can install vLLM by running the following command.
pip install "vllm>=0.4.3"
Or you can install vLLM from source.
Configure Model Settings: After downloading the model weights, modify the
config.json
file by including the below snippet:{ "architectures": [ "Qwen2ForCausalLM" ], // ... "vocab_size": 152064, // adding the following snippets "rope_scaling": { "factor": 4.0, "original_max_position_embeddings": 32768, "type": "yarn" } }
This snippet enable YARN to support longer contexts.
Model Deployment: Utilize vLLM to deploy your model. For instance, you can set up an openAI-like server using the command:
python -u -m vllm.entrypoints.openai.api_server --model pankajmathur/orca_mini_v7_7b-AWQ --quantization awq
Then you can access the Chat API by:
curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "pankajmathur/orca_mini_v7_7b-AWQ", "messages": [ {"role": "system", "content": "You are Orca Mini, a helpful AI assistant."}, {"role": "user", "content": "Hello Orca Mini, what can you do for me?"} ] }'
Note: Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the rope_scaling
configuration only when processing long contexts is required.
- Downloads last month
- 9