qwen3.5-4B-super-coder (Q_4.0 GGUF)

qwen3.5-4B-super-coder is a 4-bit quantized GGUF model optimized for fast, reliable coding, structured tool calling, and active reasoning (thinking mode) on consumer/mobile hardware. It is distilled from Claude Sonnet 4.6 & Opus 4.6, and merged/quantized using Unsloth.

Model Summary & Architecture

  • Base Model: Qwen/Qwen3.5-4B
  • Format: GGUF (Q_4.0 Quantization)
  • Size: ~2.6 GB
  • Context Window: 32K (optimized for mobile RAM budgets, natively supports up to 262K/1M context via YaRN)
  • Key Architectural Advantage: The base Qwen3.5-4B model uses a hybrid architecture combining Gated DeltaNet (3 layers) and Full Attention (1 layer) repeating. Since only 8 of the 32 layers store a full KV cache, the KV cache footprint is incredibly small (~0.4GB for 32K context), making it exceptionally well-suited for high-context coding on mobile devices (e.g., iPhone 15 Pro+, flagship Android, iPad Pro).

Distillation & Training Procedure

This model was trained using a staged Supervised Fine-Tuning (SFT) pipeline to systematically inject reasoning capability, coding specialization, and tool-calling precision:

                  ┌──────────────────────────────────────────┐
                  │                 Phase A:                 │
                  │   General Distillation (Claude Style)   │
                  │   Dataset: Claude-Distills (140K)        │
                  └────────────────────┬─────────────────────┘
                                       │
                                       ▼
                  ┌──────────────────────────────────────────┐
                  │                 Phase B:                 │
                  │   Specialization (Coding & Tool Calling) │
                  │   Dataset: Curated Replay Mix (77K)      │
                  └────────────────────┬─────────────────────┘
                                       │
                                       ▼
                  ┌──────────────────────────────────────────┐
                  │                 Phase C:                 │
                  │   Tool Precision & Schema Conformance    │
                  │   Dataset: Tool-focused Mix (~20K)       │
                  └──────────────────────────────────────────┘
  1. Phase 1: Distillation (Claude Behavior)
    • Dataset: clzoro/Claude-Distills (140K samples; Sonnet 4.6 + Opus 4.6).
    • Objective: Transfer general instruction-following, Claude-like formatting/tone, and reasoning capabilities. The Opus subset (21K samples) provided the crucial <think> block traces to establish thinking capabilities.
  2. Phase 2: Specialization (Coding & Tools)
    • Dataset: Curated 77K sample mix (55K coding instructions, 13K tool calling, and 9K general anti-forgetting replay samples).
    • Objective: Specialize the model on coding accuracy across Python, JS, Shell, etc., and introduce structured tool-calling.
  3. Phase 3: Tool Precision
    • Dataset: Focused tool-calling dataset (~20K samples) with schema variations, neg/no-tool examples, and strict JSON format targets.
    • Objective: Ensure precise JSON schema conformance and reduce tool false-positives.
  4. Phase 4: Coding/Tool Specialization Continuation
    • Starting point: jica98/qwen3.5-4b-claude-distill-lora Phase 3 LoRA.
    • Output adapter: qwen3.5-4b-phase4-specialize-lora.
    • Training mix: local filtered coding/tool data from filtered_dataset/train.jsonl, Claude distillation replay from data/claude_distill.jsonl, and an Opus replay slice to retain visible reasoning behavior.
    • Objective: Continue the distilled LoRA into a stronger coding/tool-specialized adapter while preserving anti-forgetting replay.
    • Default recipe: 1024 max sequence length, batch size 1, gradient accumulation 8, learning rate 1e-4, 1 epoch, checkpointing every 200 steps.

Phase 5 Fable Reasoning Fine-Tune

The latest adapter was further fine-tuned for Fable reasoning and agentic coding traces after the Phase 4 specialization pass.

Phase 5 training data:

  • kelexine/fable-5-sft-traces for cleaned Fable reasoning/SFT traces.
  • armand0e/claude-fable-5-claude-code for raw Claude/Fable-5 agent traces.
  • victor/fable-5-boeing-747-trace for the Boeing 747 Claude Code/Fable-5 trace.

Training summary:

  • Starting point: qwen3.5-4b-phase4-specialize-lora.
  • Output adapter: qwen3.5-4b-phase5-fable-lora.
  • After dedupe/sample in the recorded run: 4,721 examples.
  • After max-length filtering at 4096 tokens: 4,267 examples.
  • Default recipe: batch size 1, gradient accumulation 8, learning rate 5e-5, 1 epoch, BF16, adamw_8bit.

The Phase 5 data loader normalizes traces into Qwen chat-template text, groups raw Claude event logs into session conversations, deduplicates samples, filters by token length, and skips checkpoint artifacts during Hub upload by default.

Strengths & What It Is Good At

  • 💻 Conversational Programming: Excel at writing clean, efficient, and well-commented code in Python, C++, Rust, JavaScript, Shell, and more.
  • 🧠 Visible Reasoning (Thinking Mode): When faced with complex reasoning or coding tasks, the model engages a <think>...</think> block to outline its plan before writing code.
  • 🛠️ Reliable Tool Calling: Specially tuned to parse and output valid JSON tool parameters conforming to provided function schemas.
  • 📱 Mobile & Edge Execution: With a weight footprint of ~2.6GB and extremely low KV cache overhead, it fits comfortably on 8GB+ RAM edge devices.

Recommended Inference Settings

For the best balance of reasoning depth and formatting precision, use the following generation parameters:

  • Temperature: 0.6
  • Top-P: 0.95
  • Top-K: 20
  • Min-P: 0.0
  • Flash Attention: Enable -fa in llama.cpp/llama-cli for optimal speeds.
  • System Prompt: Set system prompt to guide the assistant (e.g. You are a helpful coding assistant.).

Benchmark Results (Q4_0 GGUF via LM Studio)

Benchmark run against GGUF Q4_0 quant served through LM Studio on consumer AMD ROCm hardware. Results file: benchmark/lmstudio_q4_benchmark/benchmark_report.md

Benchmark Score Status
HumanEval+ Pass@1 0.00 ok
MBPP+ Pass@1 0.00 ok
BigCodeBench-Hard needs_review
LiveCodeBench v6 not_run
BFCL v4 needs_review
IFEval needs_review
MMLU-Pro needs_review
JSON validity 40.00% ok
No-tool accuracy 87.50% ok

Notes:

  • Several benchmarks require environment setup that wasn't completed (IFEval, MMLU-Pro, BFCL, BigCodeBench-Hard).
  • HumanEval+ and MBPP+ scored 0.00 — the Q4_0 quant may degrade code generation significantly; evaluation with the BF16 base is needed for comparison.
  • JSON validity and No-tool accuracy are custom deterministic diagnostics.
Downloads last month
14,512
GGUF
Model size
4B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jica98/qwen3.5-4B-super-coder

Finetuned
Qwen/Qwen3.5-4B
Quantized
(281)
this model