Qwen3-1.7B

This version of Qwen3-1.7B has been converted to run on the Axera NPU using w8a16 quantization.

This model has been optimized with the following LoRA:

Compatible with Pulsar2 version: 5.2

Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo : https://huggingface.co/Qwen/Qwen3-1.7B

Pulsar2 Link, How to Convert LLM from Huggingface to axmodel

AXera NPU LLM Runtime

Convert the original Huggingface Qwen3-1.7B to axmodel, and then apply the w8a16 quantization to get the final axmodel for axllm runtime.

export FLOAT_MATMUL_USE_CONV_EU=1 # only support AX650, for better performance, please set this env var before running the conversion command.

# context window size 2048, prefill length 1024
pulsar2 llm_build --input_path Qwen3-1.7B --output_path <your path> \
--hidden_state_type bf16 --kv_cache_len 2048 --prefill_len 128 --chip AX650 -c 1 --parallel 32 \
--last_kv_cache_len 128 --last_kv_cache_len 256 --last_kv_cache_len 384 --last_kv_cache_len 512 \
--last_kv_cache_len 640 --last_kv_cache_len 768 --last_kv_cache_len 896 --last_kv_cache_len 1024 -w s8

Support Platform

AX650
- M4N-Dock(爱芯派Pro)
- M.2 Accelerator card

Chips	w8a16	CMM	Flash
AX650	8.6 tokens/sec	2.4 GiB	2.6GiB

How to use

安装 axllm

方式一：克隆仓库后执行安装脚本：

git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git
cd ax-llm
./install.sh

方式二：一行命令安装（默认分支 axllm）：

curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh | bash

方式三：下载Github Actions CI 导出的可执行程序（适合没有编译环境的用户）：

如果没有编译环境，请到： https://github.com/AXERA-TECH/ax-llm/actions?query=branch%3Aaxllm 下载 最新 CI 导出的可执行程序（axllm），然后：

chmod +x axllm
sudo mv axllm /usr/bin/axllm

模型下载（Hugging Face）

mkdir -p AXERA-TECH/Qwen3-1.7B
cd AXERA-TECH/Qwen3-1.7B
hf download AXERA-TECH/Qwen3-1.7B --local-dir .

# structure of the downloaded files
tree -L 3
.
└── AXERA-TECH
    └── Qwen3-1.7B
        ├── README.md
        ├── config.json
        ├── model.embed_tokens.weight.bfloat16.bin
        ├── post_config.json
        ├── qwen3_p128_l0_together.axmodel
...
        ├── qwen3_p128_l9_together.axmodel
        ├── qwen3_post.axmodel
        └── qwen3_tokenizer.txt

2 directories, 34 files

Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650N DEMO Board

运行（CLI）

(base) root@ax650:~# axllm run AXERA-TECH/Qwen3-1.7B/
14:56:46.670 INF Init:890 | LLM init start
tokenizer_type = 1
 96% | ##############################   |  30 /  31 [3.84s<3.97s, 7.81 count/s] init post axmodel ok,remain_cmm(7523 MB)
14:56:50.510 INF Init:1045 | max_token_len : 2048
14:56:50.510 INF Init:1048 | kv_cache_size : 1024, kv_cache_num: 2048
14:56:50.510 INF init_groups_from_model:606 | prefill_token_num : 128
14:56:50.510 INF init_groups_from_model:820 | decode grp: 0, gid: 0, max_token_len : 2048
14:56:50.510 INF init_groups_from_model:824 | prefill grp: 0, gid: 1, history_cap: 0, total_cap: 128, symbolic_cap: 1
14:56:50.510 INF init_groups_from_model:824 | prefill grp: 1, gid: 2, history_cap: 128, total_cap: 256, symbolic_cap: 128
14:56:50.510 INF init_groups_from_model:824 | prefill grp: 2, gid: 3, history_cap: 256, total_cap: 384, symbolic_cap: 256
14:56:50.510 INF init_groups_from_model:824 | prefill grp: 3, gid: 4, history_cap: 384, total_cap: 512, symbolic_cap: 384
14:56:50.510 INF init_groups_from_model:824 | prefill grp: 4, gid: 5, history_cap: 512, total_cap: 640, symbolic_cap: 512
14:56:50.510 INF init_groups_from_model:824 | prefill grp: 5, gid: 6, history_cap: 640, total_cap: 768, symbolic_cap: 640
14:56:50.510 INF init_groups_from_model:824 | prefill grp: 6, gid: 7, history_cap: 768, total_cap: 896, symbolic_cap: 768
14:56:50.510 INF init_groups_from_model:824 | prefill grp: 7, gid: 8, history_cap: 896, total_cap: 1024, symbolic_cap: 896
14:56:50.510 INF init_groups_from_model:824 | prefill grp: 8, gid: 9, history_cap: 1024, total_cap: 1152, symbolic_cap: 1024
14:56:50.510 INF init_groups_from_model:831 | prefill_max_token_num : 1152
14:56:50.510 INF Init:27 | LLaMaEmbedSelector use mmap
100% | ################################ |  31 /  31 [3.84s<3.84s, 8.07 count/s] embed_selector init ok
14:56:50.511 INF load_config:282 | load config:
14:56:50.511 INF load_config:282 | {
14:56:50.511 INF load_config:282 |     "enable_repetition_penalty": false,
14:56:50.511 INF load_config:282 |     "enable_temperature": false,
14:56:50.511 INF load_config:282 |     "enable_top_k_sampling": false,
14:56:50.511 INF load_config:282 |     "enable_top_p_sampling": false,
14:56:50.511 INF load_config:282 |     "penalty_window": 20,
14:56:50.511 INF load_config:282 |     "repetition_penalty": 1.2,
14:56:50.511 INF load_config:282 |     "temperature": 0.9,
14:56:50.511 INF load_config:282 |     "top_k": 10,
14:56:50.511 INF load_config:282 |     "top_p": 0.8
14:56:50.511 INF load_config:282 | }
14:56:50.511 INF Init:1139 | LLM init ok
Commands:
  /q, /exit  退出
  /reset     重置 kvcache
  /dd        删除一轮对话
  /pp        打印历史对话
Ctrl+C: 停止当前生成
----------------------------------------
prompt >> who are you
14:56:53.365 INF SetKVCache:1437 | decode_grpid:0 prefill_grpid:1 history_cap:0 total_cap:128 symbolic_cap:1 precompute_len:0 input_num_token:22 prefer_symbolic_group:0
14:56:53.365 INF SetKVCache:1458 | current prefill_max_token_num:1152
14:56:53.462 INF SetKVCache:1462 | first run
14:56:53.464 INF Run:1553 | input token num : 22, prefill_split_num : 1
14:56:53.464 INF Run:1640 | prefill chunk p=0 history_len=0 grpid=1 kv_cache_num=0 input_tokens=22
14:56:53.465 INF Run:1665 | prefill indices shape: p=0 idx_elems=128 idx_rows=1 pos_rows=0
14:56:53.656 INF Run:1837 | ttft: 191.80 ms
<think>
Okay, the user asked, "Who are you?" I need to respond appropriately. Let me think.

First, I should introduce myself clearly. My name is Assistant, but I should mention that I'm an AI developed by Alibaba Group. That's important to set the context.

Next, I should explain my purpose. I'm here to help with questions, provide information, and assist with tasks. It's important to highlight that I can't perform actions like accessing external data or making decisions, but I can offer help based on the information I have.

I should also mention that I'm designed to be helpful and friendly, but I can't replace real people. It's good to remind the user that if they have specific questions, they should ask directly.

I need to keep the tone friendly and approachable. Avoid technical jargon. Make sure the response is concise but covers all the key points. Let me check if I missed anything. Oh, maybe mention that I can help with various topics, but I can't provide personal advice. That's a good point to include.

Alright, putting it all together in a natural, conversational way.
</think>

Hello! I'm Assistant, an AI developed by Alibaba Group. I'm here to help with questions, provide information, and assist with tasks. I can't perform actions like accessing external data or making decisions, but I can offer help based on the information I have. I'm designed to be helpful and friendly, but I can't replace real people. If you have specific questions or need assistance, feel free to ask! 😊

14:57:31.019 NTC Run:2102 | hit eos,decode avg 8.59 token/s
14:57:31.019 INF GetKVCache:1408 | precompute_len:344, remaining:808
prompt >> /q

启动服务（OpenAI 兼容）

(base) root@ax650:~# axllm serve AXERA-TECH/Qwen3-1.7B/
15:00:30.500 INF Init:890 | LLM init start
tokenizer_type = 1
 96% | ##############################   |  30 /  31 [3.19s<3.29s, 9.42 count/s] init post axmodel ok,remain_cmm(7523 MB)
15:00:33.685 INF Init:1045 | max_token_len : 2048
15:00:33.685 INF Init:1048 | kv_cache_size : 1024, kv_cache_num: 2048
15:00:33.685 INF init_groups_from_model:606 | prefill_token_num : 128
15:00:33.685 INF init_groups_from_model:820 | decode grp: 0, gid: 0, max_token_len : 2048
15:00:33.685 INF init_groups_from_model:824 | prefill grp: 0, gid: 1, history_cap: 0, total_cap: 128, symbolic_cap: 1
15:00:33.685 INF init_groups_from_model:824 | prefill grp: 1, gid: 2, history_cap: 128, total_cap: 256, symbolic_cap: 128
15:00:33.685 INF init_groups_from_model:824 | prefill grp: 2, gid: 3, history_cap: 256, total_cap: 384, symbolic_cap: 256
15:00:33.685 INF init_groups_from_model:824 | prefill grp: 3, gid: 4, history_cap: 384, total_cap: 512, symbolic_cap: 384
15:00:33.685 INF init_groups_from_model:824 | prefill grp: 4, gid: 5, history_cap: 512, total_cap: 640, symbolic_cap: 512
15:00:33.685 INF init_groups_from_model:824 | prefill grp: 5, gid: 6, history_cap: 640, total_cap: 768, symbolic_cap: 640
15:00:33.685 INF init_groups_from_model:824 | prefill grp: 6, gid: 7, history_cap: 768, total_cap: 896, symbolic_cap: 768
15:00:33.685 INF init_groups_from_model:824 | prefill grp: 7, gid: 8, history_cap: 896, total_cap: 1024, symbolic_cap: 896
15:00:33.685 INF init_groups_from_model:824 | prefill grp: 8, gid: 9, history_cap: 1024, total_cap: 1152, symbolic_cap: 1024
15:00:33.686 INF init_groups_from_model:831 | prefill_max_token_num : 1152
15:00:33.686 INF Init:27 | LLaMaEmbedSelector use mmap
100% | ################################ |  31 /  31 [3.19s<3.19s, 9.73 count/s] embed_selector init ok
15:00:33.686 INF load_config:282 | load config:
15:00:33.686 INF load_config:282 | {
15:00:33.686 INF load_config:282 |     "enable_repetition_penalty": false,
15:00:33.686 INF load_config:282 |     "enable_temperature": false,
15:00:33.686 INF load_config:282 |     "enable_top_k_sampling": false,
15:00:33.686 INF load_config:282 |     "enable_top_p_sampling": false,
15:00:33.686 INF load_config:282 |     "penalty_window": 20,
15:00:33.686 INF load_config:282 |     "repetition_penalty": 1.2,
15:00:33.686 INF load_config:282 |     "temperature": 0.9,
15:00:33.686 INF load_config:282 |     "top_k": 10,
15:00:33.686 INF load_config:282 |     "top_p": 0.8
15:00:33.686 INF load_config:282 | }
15:00:33.686 INF Init:1139 | LLM init ok
Starting server on port 8000 with model 'AXERA-TECH/Qwen3-1.7B'...
API URLs:
  GET  http://127.0.0.1:8000/health
  GET  http://127.0.0.1:8000/v1/models
  POST http://127.0.0.1:8000/v1/chat/completions
  GET  http://10.126.29.54:8000/health
  GET  http://10.126.29.54:8000/v1/models
  POST http://10.126.29.54:8000/v1/chat/completions
  GET  http://172.17.0.1:8000/health
  GET  http://172.17.0.1:8000/v1/models
  POST http://172.17.0.1:8000/v1/chat/completions
Aliases:
  GET  http://127.0.0.1:8000/models
  POST http://127.0.0.1:8000/chat/completions
  GET  http://10.126.29.54:8000/models
  POST http://10.126.29.54:8000/chat/completions
  GET  http://172.17.0.1:8000/models
  POST http://172.17.0.1:8000/chat/completions
OpenAI API Server starting on http://0.0.0.0:8000
Max concurrency: 1
Models: AXERA-TECH/Qwen3-1.7B

OpenAI 调用示例

from openai import OpenAI

API_URL = "http://127.0.0.1:8000/v1"
MODEL = "AXERA-TECH/Qwen3-1.7B"

messages = [
    {"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
    {"role": "user", "content": "hello"},
]

client = OpenAI(api_key="not-needed", base_url=API_URL)
completion = client.chat.completions.create(
    model=MODEL,
    messages=messages,
)

print(completion.choices[0].message.content)

OpenAI 流式调用示例

from openai import OpenAI

API_URL = "http://127.0.0.1:8000/v1"
MODEL = "AXERA-TECH/Qwen3-1.7B"

messages = [
    {"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
    {"role": "user", "content": "hello"},
]

client = OpenAI(api_key="not-needed", base_url=API_URL)
stream = client.chat.completions.create(
    model=MODEL,
    messages=messages,
    stream=True,
)

print("assistant:")
for ev in stream:
    delta = getattr(ev.choices[0], "delta", None)
    if delta and getattr(delta, "content", None):
        print(delta.content, end="", flush=True)
print("")

Downloads last month: 100

Model tree for AXERA-TECH/Qwen3-1.7B

Base model

Qwen/Qwen3-1.7B-Base

Finetuned

Qwen/Qwen3-1.7B

Finetuned

(782)

this model

Collection including AXERA-TECH/Qwen3-1.7B

Qwen3

Collection

7 items • Updated 19 days ago • 1