Qwen
Collection
5 items
β’
Updated
This version of Qwen2.5-0.5B-Instruct-GPTQ-Int4 has been converted to run on the Axera NPU using w4a16 quantization.
This model has been optimized with the following LoRA:
Compatible with Pulsar2 version: 3.4(Not released yet)
For those who are interested in model conversion, you can try to export axmodel through the original repo : https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct-GPTQ-Int4
Pulsar2 Link, How to Convert LLM from Huggingface to axmodel
Chips | w8a16 | w4a16 |
---|---|---|
AX650 | 28 tokens/sec | 44 tokens/sec |
Download all files from this repository to the device
root@ax650:/mnt/qtang/llm-test/qwen2.5-0.5b# tree -L 1
.
βββ qwen2.5-0.5b-gptq-int4-ax650
βββ qwen2.5_tokenizer
βββ qwen2.5_tokenizer.py
βββ main_axcl_aarch64
βββ main_axcl_x86
βββ main_prefill
βββ post_config.json
βββ run_qwen2.5_0.5b_gptq_int4_ax650.sh
βββ run_qwen2.5_0.5b_gptq_int4_axcl_aarch64.sh
βββ run_qwen2.5_0.5b_gptq_int4_axcl_x86.sh
root@ax650:/mnt/qtang/llm-test/qwen2.5-0.5b# python3 qwen2.5_tokenizer.py --port 12345
None None 151645 <|im_end|>
<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
hello world<|im_end|>
<|im_start|>assistant
[151644, 8948, 198, 2610, 525, 1207, 16948, 11, 3465, 553, 54364, 14817, 13, 1446, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 14990, 1879, 151645, 198, 151644, 77091, 198]
http://localhost:12345
Open another terminal and run run_qwen2.5_0.5b_gptq_int4_ax650.sh
root@ax650:/mnt/qtang/llm-test/qwen2.5-0.5b# ./run_qwen2.5_0.5b_gptq_int4_ax650.sh
[I][ Init][ 125]: LLM init start
bos_id: -1, eos_id: 151645
3% | ββ | 1 / 27 [0.00s<0.08s, 333.33 count/s] tokenizer init ok
[I][ Init][ 26]: LLaMaEmbedSelector use mmap
100% | ββββββββββββββββββββββββββββββββ | 27 / 27 [1.34s<1.34s, 20.10 count/s] init post axmodel ok,remain_cmm(3427 MB)
[I][ Init][ 241]: max_token_len : 1024
[I][ Init][ 246]: kv_cache_size : 128, kv_cache_num: 1024
[I][ Init][ 254]: prefill_token_num : 128
[I][ load_config][ 281]: load config:
{
"enable_repetition_penalty": false,
"enable_temperature": true,
"enable_top_k_sampling": true,
"enable_top_p_sampling": false,
"penalty_window": 20,
"repetition_penalty": 1.2,
"temperature": 0.9,
"top_k": 10,
"top_p": 0.8
}
[I][ Init][ 268]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
>> who are you
[I][ Run][ 466]: ttft: 134.66 ms
I am Qwen, a Qwen AI created by Alibaba Cloud. I am here to assist you with various topics and provide help to the best of my ability. I am here to help
with any questions you have about science, technology, or any other topic you might have for help or guidance. I am always happy to help you!
[N][ Run][ 605]: hit eos,avg 42.11 token/s
>> 1+1=?
[I][ Run][ 466]: ttft: 135.07 ms
1+1=2
[N][ Run][ 605]: hit eos,avg 43.04 token/s
What is M.2 Accelerator card?, Show this DEMO based on Raspberry PI 5.
(base) axera@raspberrypi:~/samples/qwen2.5-0.5b $ ./run_qwen2.5_0.5b_gptq_int4_axcl_aarch64.sh
build time: Feb 13 2025 15:44:57
[I][ Init][ 111]: LLM init start
bos_id: -1, eos_id: 151645
100% | ββββββββββββββββββββββββββββββββ | 27 / 27 [11.64s<11.64s, 2.32 count/s] init post axmodel okmain_cmm(6788 MB)
[I][ Init][ 226]: max_token_len : 1024
[I][ Init][ 231]: kv_cache_size : 128, kv_cache_num: 1024
[I][ load_config][ 282]: load config:
{
"enable_repetition_penalty": false,
"enable_temperature": true,
"enable_top_k_sampling": true,
"enable_top_p_sampling": false,
"penalty_window": 20,
"repetition_penalty": 1.2,
"temperature": 0.9,
"top_k": 10,
"top_p": 0.8
}
[I][ Init][ 288]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
>> who are you
I am Qwen, a Qwen-like language model created by Alibaba Cloud. I am designed to assist users in answering questions, generating text,
and participating in conversations. I am here to help you with your questions and to engage in meaningful exchanges with you.
If you have any questions, you can ask me, and if you want, you can even write to me!
[N][ Run][ 610]: hit eos,avg 25.88 token/s
>> 1+1=?
1+1=2
[N][ Run][ 610]: hit eos,avg 29.73 token/s
>> q
(base) axera@raspberrypi:~/samples/qwen2.5-0.5b $ axcl-smi
+------------------------------------------------------------------------------------------------+
| AXCL-SMI V2.26.0_20250205130139 Driver V2.26.0_20250205130139 |
+-----------------------------------------+--------------+---------------------------------------+
| Card Name Firmware | Bus-Id | Memory-Usage |
| Fan Temp Pwr:Usage/Cap | CPU NPU | CMM-Usage |
|=========================================+==============+=======================================|
| 0 AX650N V2.26.0 | 0000:01:00.0 | 170 MiB / 945 MiB |
| -- 43C -- / -- | 2% 0% | 392 MiB / 7040 MiB |
+-----------------------------------------+--------------+---------------------------------------+
+------------------------------------------------------------------------------------------------+
| Processes: |
| Card PID Process Name NPU Memory Usage |
|================================================================================================|
| 0 474440 /home/axera/samples/qwen2.5-0.5b-gptq-int4/main_axcl_aarch64 370172 KiB |
+------------------------------------------------------------------------------------------------+
(base) axera@raspberrypi:~ $
Base model
Qwen/Qwen2.5-0.5B