Qwen3-4B-Instruct-2507-GPTQ-Int4

This version of Qwen3-4B-Instruct-2507-GPTQ-Int4 has been converted to Qwen3-4B-Instruct-2507-GPTQ-Int4 by GPTQModel run on the Axera NPU using w4a16 quantization.

This model has been optimized with the following LoRA:

Compatible with Pulsar2 version: 5.1

Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo : https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507

Pulsar2 Link, How to Convert LLM from Huggingface to axmodel

AXera NPU LLM Runtime

Support Platform

Chips w8a16 w4a16
AX650 TBD 5.7 tokens/sec

How to use

Download all files from this repository to the device

(dev_env) root@ax650:~/axera/Qwen3-4B-Instruct-2507-GPTQ-Int4# tree -L 1
.
|-- Qwen3-4B-Instruct-2507-GPTQ-Int4-context-4k-prefill-3584
|-- README.md
|-- config.json
|-- main_api_ax650
|-- main_api_axcl_aarch64
|-- main_api_axcl_x86
|-- main_ax650
|-- main_axcl_aarch64
|-- main_axcl_x86
|-- post_config.json
|-- qwen2.5_tokenizer
|-- qwen3_tokenizer
|-- qwen3_tokenizer_uid.py
|-- run_qwen3_4b_int4_gptq_ax650.sh
|-- run_qwen3_4b_int4_gptq_ax650_api.sh
|-- run_qwen3_4b_int4_gptq_axcl_aarch64.sh
|-- run_qwen3_4b_int4_gptq_axcl_x86.sh
`-- run_qwen3_4b_int4_gptq_axcl_x86_api.sh

3 directories, 15 files

Start the Tokenizer service

Install requirement

pip install transformers jinja2
(dev_env) root@ax650:~/axera/Qwen3-4B-Instruct-2507-GPTQ-Int4# python3 qwen3_tokenizer_uid.py
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Server running at http://0.0.0.0:12345

Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650N DEMO Board

Open another terminal and run run_qwen3_4b_int4_gptq_ax650.sh

(dev_env) root@ax650:~/axera/Qwen3-4B-Instruct-2507-GPTQ-Int4# ./run_qwen3_4b_int4_gptq_ax650.sh
[I][                            Init][ 110]: LLM init start
[I][                            Init][  34]: connect http://127.0.0.1:12345 ok
[I][                            Init][  57]: uid: 7da8d04b-3172-49bb-9ae2-e71b266f86a8
bos_id: -1, eos_id: 151645
  2% | █                                 |   1 /  39 [3.62s<141.14s, 0.28 count/s] tokenizer init ok[I][                            Init][  26]: LLaMaEmbedSelector use mmap
100% | ████████████████████████████████ |  39 /  39 [10.59s<10.59s, 3.68 count/s] init post axmodel ok,remain_cmm(5306 MB)[I][                            Init][ 188]: max_token_len : 4095
[I][                            Init][ 193]: kv_cache_size : 1024, kv_cache_num: 4095
[I][                            Init][ 201]: prefill_token_num : 256
[I][                            Init][ 205]: grp: 1, prefill_max_token_num : 1
[I][                            Init][ 205]: grp: 2, prefill_max_token_num : 256
[I][                            Init][ 205]: grp: 3, prefill_max_token_num : 512
[I][                            Init][ 205]: grp: 4, prefill_max_token_num : 768
[I][                            Init][ 205]: grp: 5, prefill_max_token_num : 1024
[I][                            Init][ 205]: grp: 6, prefill_max_token_num : 1280
[I][                            Init][ 205]: grp: 7, prefill_max_token_num : 1536
[I][                            Init][ 205]: grp: 8, prefill_max_token_num : 1792
[I][                            Init][ 205]: grp: 9, prefill_max_token_num : 2048
[I][                            Init][ 205]: grp: 10, prefill_max_token_num : 2304
[I][                            Init][ 205]: grp: 11, prefill_max_token_num : 2560
[I][                            Init][ 205]: grp: 12, prefill_max_token_num : 2816
[I][                            Init][ 205]: grp: 13, prefill_max_token_num : 3072
[I][                            Init][ 205]: grp: 14, prefill_max_token_num : 3328
[I][                            Init][ 205]: grp: 15, prefill_max_token_num : 3584
[I][                            Init][ 209]: prefill_max_token_num : 3584
[I][                     load_config][ 282]: load config:
{
    "enable_repetition_penalty": false,
    "enable_temperature": false,
    "enable_top_k_sampling": true,
    "enable_top_p_sampling": false,
    "penalty_window": 20,
    "repetition_penalty": 1.2,
    "temperature": 0.9,
    "top_k": 1,
    "top_p": 0.8
}

[I][                            Init][ 218]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
[I][          GenerateKVCachePrefill][ 280]: input token num : 21, prefill_split_num : 1 prefill_grpid : 2
[I][          GenerateKVCachePrefill][ 320]: input_num_token:21
[I][                            main][ 228]: precompute_len: 21
[I][                            main][ 229]: system_prompt: You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
prompt >> 你是谁
[I][                      SetKVCache][ 467]: prefill_grpid:2 kv_cache_num:256 precompute_len:21 input_num_token:13
[I][                      SetKVCache][ 470]: current prefill_max_token_num:3328
[I][                             Run][ 596]: input token num : 13, prefill_split_num : 1
[I][                             Run][ 622]: input_num_token:13
[I][                             Run][ 745]: ttft: 2673.40 ms
你好,我是Qwen,由阿里 Cloud研发的超大规模语言模型。我能够回答问题、创作文字,比如写故事、写公文、写邮件、写剧本、逻辑推理、编程等等,还能表达观点,玩游戏。如果你有任何需要,都可以告诉我,我会尽力提供帮助。😊

[N][                             Run][ 864]: hit eos,avg 5.69 token/s

[I][                      GetKVCache][ 436]: precompute_len:97, remaining:3487
prompt >> 1+1=
[I][                      SetKVCache][ 467]: prefill_grpid:2 kv_cache_num:256 precompute_len:97 input_num_token:15
[I][                      SetKVCache][ 470]: current prefill_max_token_num:3328
[I][                             Run][ 596]: input token num : 15, prefill_split_num : 1
[I][                             Run][ 622]: input_num_token:15
[I][                             Run][ 745]: ttft: 2672.85 ms
1 + 1 = 2。 😊

[N][                             Run][ 864]: hit eos,avg 5.69 token/s

[I][                      GetKVCache][ 436]: precompute_len:122, remaining:3462
prompt >> q
(dev_env) root@ax650:~/axera/Qwen3-4B-Instruct-2507-GPTQ-Int4# 
Downloads last month
31
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AXERA-TECH/Qwen3-4B-Instruct-2507-GPTQ-Int4

Finetuned
(331)
this model