Qwen3.5-0.8B β€” Apple Core AI (.aimodel)

Qwen3.5-0.8B converted to Apple's Core AI (the Core ML successor announced at WWDC26), ready to run on iOS 27 / macOS 27. A hybrid linear-attention model β€” 3 gated-delta (Mamba-style) layers per full-attention layer β€” running through Core AI's runtime, greedy top-1 exact vs the Hugging Face reference.

This repo publishes one bundle per platform Γ— compute-unit: the best verified configuration (plus the cross-platform gpu-pipelined/ bundle) β€” each file is the exact artifact behind the published numbers, nothing experimental.

Requires the iOS 27 / macOS 27 beta (Core AI ships with the OS). Conversion code, knowledge base, and the Swift runner: coreai-model-zoo.

Pick your platform (measured: iPhone 17 Pro / M4 Max, greedy, top-1 exact vs HF)

Category File Precision Size Speed
GPU pipelined β˜…β˜…β˜… (iOS + macOS, NEW ship) gpu-pipelined/qwen3_5_0_8b_decode_int8hu_perchan_sym/ β€” full bundle (.aimodel + tokenizer + metadata) int8 linear per-block-32 + per-block-32 absmax int8 lm_head (untied; the dir name says perchan for historical reasons β€” see note below) 1.3 GB 69.7–74.0 tok/s iPhone 17 Pro Β· 210 tok/s M4 Max
GPU pipelined β˜…β˜… (iOS + macOS) gpu-pipelined/qwen3_5_0_8b_decode_int8lin/ β€” full bundle (.aimodel + tokenizer + metadata) int8 linear per-block-32 (no LUT), fp16 tied head, decode-only loop-free, dynamic KV 1.0 GB 50.3–51.5 tok/s iPhone 17 Pro Β· 204 tok/s M4 Max
iOS GPU β˜… ios-gpu/qwen3_5_0_8b_ios_hc0_int8v3.aimodel int8 fused Metal kernels (k-means LUT, fp32 accumulate) + GPU argmax head, static ctx-2048 1.3 GB 42.5–45.4 tok/s decode
iOS GPU β˜… companion ios-gpu/qwen3_5_0_8b_ios_hc_prefill_q16_b2048_int8.aimodel chunked-prefill graph (q=16 blocks, int8 LUT) 1.0 GB 147 tok/s prefill (185-tok prompt: 4.2 s β†’ 1.26 s)
iOS GPU (previous) ios-gpu/qwen3_5_0_8b_ios_hc0.aimodel fp16, static ctx-2048 1.4 GB 27.7 tok/s
iOS ANE ios-ane/qwen3_5_0_8b_decode_int8.aimodel int8 k-means (fp16 embed), dynamic 969 MB 14.7 tok/s
macOS GPU macos/qwen3_5_0_8b_decode_int8.aimodel same bundle as iOS ANE 969 MB 58.5 tok/s (release build)
  • The β˜…β˜…β˜… ship bundle adds an untied lm_head quantized as per-block-32 absmax int8 (int8hu --head-sym): the fp16 head was 54% of the per-token weight read on the bandwidth-bound phone β€” quantizing it is +40% on iPhone (and +3% on M4 Max). Quantize big-vocab heads with plain absmax symmetric; the default symmetric_with_clipping clips outlier head rows and corrupts top-1s. Greedy rollouts are token-identical to the β˜…β˜… bundle; same run contract. Naming note (2026-06-11): the directory is named _perchan_sym, but its head is per-block-32 β€” the export script of the day parsed the granularity flag without applying it (since fixed). The numbers above were measured on exactly these bytes and stand. Genuinely per-channel (axis-0) int8 weights turned out to be broken on the current beta GPU delegate (garbage logits β€” delegate lowering bug, minimal repro in the zoo), so per-block-32 + symmetric IS the correct ship shape, not a stand-in. The dir name is kept to avoid breaking download paths.
  • The β˜…β˜… pipelined bundle is the fastest decode on BOTH platforms, with zero custom kernels: a decode-only loop-free graph (static [1,1] query, dynamic KV) that rides Apple's coreai-pipelined engine (CoreAILanguageModels / EngineFactory β€” async non-blocking encode, on-GPU argmax sampling, on-device KV growth) instead of a per-token run loop. Token-for-token == the fp16-GPU sequence; 16/16 single-step top-1 vs the fp32 HF oracle. It needs two things from the zoo: the engine extra-states patch (the stock engine carries exactly 2 states; the SSM conv/rec states ride as fixed-shape extras) and COREAI_CHUNK_THRESHOLD=1 at run time (prefill = pipelined S=1 steps β‰ˆ decode speed β€” so for LONG prompts the β˜… static pair below still wins time-to-first-token). Export: conversion/export_qwen3_5_decode_pipelined.py.
  • The β˜… int8 fused-kernel monolith is the custom-kernel static config (~3Γ— dynamic, ~1.6Γ— the fp16 static path; the current app-release config): the device GPU is weight-bandwidth-bound, so fused dequant-in-matvec Metal kernels (embedded in the .aimodel β€” 100% Core AI, WWDC26 session 325) halve the per-token weight stream; the 248320-token tied head runs as a fused matvec + two-level GPU argmax (greedy). Pair it with the prefill companion: the prompt is consumed 16 tokens per pass (in-graph unrolled SSM scan, fp32 recurrence; full blocks only, remainder + generation on the decode graph). Decode output is byte-identical with and without it.
  • The static monoliths are GPU-only β€” the fp32-SSM form does not produce correct output on the ANE on current betas (and custom Metal kernels are GPU-only); use the ANE bundle there.
  • The dynamic int8 bundle (one graph, prefill+decode, 4 states keyCache/valueCache/convState/recState) is the proven Neural-Engine path β€” and the same file is the best macOS config (the ios-ane/ and macos/ files are identical content; pick by folder for clarity).
  • The SSM while_loop doesn't lower on device delegates β€” these bundles use the loop-free single-step decode (bit-identical at query_len=1; the prefill graph unrolls the same scan 16Γ— with the state held fp32). Story + gotchas: knowledge base.
  • int8 in the static/dynamic bundles is k-means palettization, gated 8/8 vs the HF oracle in PyTorch before export; the pipelined bundle uses linear per-block int8 (scale-multiply dequant β€” 256-entry k-means LUTs are slow on the GPU delegate, 204 vs 113 tok/s on M4 Max). int4 does not survive on this model (head/MLP/SSM all degrade; k-means g8 with int8 rescue layers also fails the oracle gate; unlike Gemma 4).

Run it (pipelined, Swift, macOS 27)

git clone https://github.com/apple/coreai-models && cd coreai-models
git apply <(curl -sL https://github.com/john-rocky/coreai-model-zoo/raw/main/apps/coreai-pipelined-extra-states.patch)
COREAI_CHUNK_THRESHOLD=1 swift run -c release llm-benchmark \
  --model <path-to>/gpu-pipelined/qwen3_5_0_8b_decode_int8lin -p 128 -g 256 -n 3

In an app, load the bundle via LanguageBundle + EngineFactory.createEngine (set COREAI_CHUNK_THRESHOLD=1 before engine creation; never call warmup() β€” it warms shape 256, the S=1 graph rejects it; a 1-token generate is the warmup).

Run it (Python, macOS 27)

import coreai.runtime as rt
model = await rt.AIModel.load(Path("qwen3_5_0_8b_decode_int8.aimodel"),
        rt.SpecializationOptions.from_preferred_compute_unit_kind(rt.ComputeUnitKind.gpu()))
fn = model.load_function("main")
out = await fn({"input_ids": rt.NDArray(ids), "position_ids": rt.NDArray(pos)}, state=state)

On device, push the bundle into your app sandbox (xcrun devicectl device copy to --domain-type appDataContainer) β€” see the Swift runtime notes. Tokenizer: use the original Qwen/Qwen3.5-0.8B tokenizer (swift-transformers loads it directly).

Parity

Greedy decode matches the HF eager reference 8/8 tokens, top-1 exact (prompt-level cosine 0.9999+), verified on macOS conversion and re-verified end-to-end on the iPhone per compute unit. The int8-kernel monolith additionally passes the chained Mac-GPU greedy 8/8 vs the oracle, and the prefill companion passes both an oracle gate and a chunked-vs-q=1 parity gate (identical tokens). ⚠️ Known beta issue affecting all Core AI LLMs (and how these bundles dodge it): the KV-write bug page.

CoreML (iOS 18+) variant of this model: qwen3.5-0.8B-CoreML.

Downloads last month
19
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mlboydaisuke/qwen3.5-0.8B-CoreAI

Finetuned
(234)
this model