Instructions to use jica98/qwen3.5-4B-super-coder with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use jica98/qwen3.5-4B-super-coder with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="jica98/qwen3.5-4B-super-coder",
	filename="qwen3.5-4B-super-coder.BF16-mmproj.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use jica98/qwen3.5-4B-super-coder with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf jica98/qwen3.5-4B-super-coder:BF16
# Run inference directly in the terminal:
llama cli -hf jica98/qwen3.5-4B-super-coder:BF16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf jica98/qwen3.5-4B-super-coder:BF16
# Run inference directly in the terminal:
llama cli -hf jica98/qwen3.5-4B-super-coder:BF16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf jica98/qwen3.5-4B-super-coder:BF16
# Run inference directly in the terminal:
./llama-cli -hf jica98/qwen3.5-4B-super-coder:BF16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf jica98/qwen3.5-4B-super-coder:BF16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf jica98/qwen3.5-4B-super-coder:BF16

Use Docker

docker model run hf.co/jica98/qwen3.5-4B-super-coder:BF16

LM Studio
Jan

vLLM

How to use jica98/qwen3.5-4B-super-coder with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "jica98/qwen3.5-4B-super-coder"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "jica98/qwen3.5-4B-super-coder",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/jica98/qwen3.5-4B-super-coder:BF16

Ollama
How to use jica98/qwen3.5-4B-super-coder with Ollama:
```
ollama run hf.co/jica98/qwen3.5-4B-super-coder:BF16
```

Unsloth Studio

How to use jica98/qwen3.5-4B-super-coder with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for jica98/qwen3.5-4B-super-coder to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for jica98/qwen3.5-4B-super-coder to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for jica98/qwen3.5-4B-super-coder to start chatting

How to use jica98/qwen3.5-4B-super-coder with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf jica98/qwen3.5-4B-super-coder:BF16

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "jica98/qwen3.5-4B-super-coder:BF16"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use jica98/qwen3.5-4B-super-coder with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf jica98/qwen3.5-4B-super-coder:BF16

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default jica98/qwen3.5-4B-super-coder:BF16

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use jica98/qwen3.5-4B-super-coder with Docker Model Runner:
```
docker model run hf.co/jica98/qwen3.5-4B-super-coder:BF16
```

Lemonade

How to use jica98/qwen3.5-4B-super-coder with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull jica98/qwen3.5-4B-super-coder:BF16

Run and chat with the model

lemonade run user.qwen3.5-4B-super-coder-BF16

List all available models

lemonade list

qwen3.5-4B-super-coder (Q_4.0 GGUF)

qwen3.5-4B-super-coder is a 4-bit quantized GGUF model optimized for fast, reliable coding, structured tool calling, and active reasoning (thinking mode) on consumer/mobile hardware. It is distilled from Claude Sonnet 4.6 & Opus 4.6, and merged/quantized using Unsloth.

Model Summary & Architecture

Base Model: Qwen/Qwen3.5-4B
Format: GGUF (Q_4.0 Quantization)
Size: ~2.6 GB
Context Window: 32K (optimized for mobile RAM budgets, natively supports up to 262K/1M context via YaRN)
Key Architectural Advantage: The base Qwen3.5-4B model uses a hybrid architecture combining Gated DeltaNet (3 layers) and Full Attention (1 layer) repeating. Since only 8 of the 32 layers store a full KV cache, the KV cache footprint is incredibly small (~0.4GB for 32K context), making it exceptionally well-suited for high-context coding on mobile devices (e.g., iPhone 15 Pro+, flagship Android, iPad Pro).

Distillation & Training Procedure

This model was trained using a staged Supervised Fine-Tuning (SFT) pipeline to systematically inject reasoning capability, coding specialization, and tool-calling precision:

                  ┌──────────────────────────────────────────┐
                  │                 Phase A:                 │
                  │   General Distillation (Claude Style)   │
                  │   Dataset: Claude-Distills (140K)        │
                  └────────────────────┬─────────────────────┘
                                       │
                                       ▼
                  ┌──────────────────────────────────────────┐
                  │                 Phase B:                 │
                  │   Specialization (Coding & Tool Calling) │
                  │   Dataset: Curated Replay Mix (77K)      │
                  └────────────────────┬─────────────────────┘
                                       │
                                       ▼
                  ┌──────────────────────────────────────────┐
                  │                 Phase C:                 │
                  │   Tool Precision & Schema Conformance    │
                  │   Dataset: Tool-focused Mix (~20K)       │
                  └──────────────────────────────────────────┘

Phase 1: Distillation (Claude Behavior)
- Dataset: clzoro/Claude-Distills (140K samples; Sonnet 4.6 + Opus 4.6).
- Objective: Transfer general instruction-following, Claude-like formatting/tone, and reasoning capabilities. The Opus subset (21K samples) provided the crucial <think> block traces to establish thinking capabilities.
Phase 2: Specialization (Coding & Tools)
- Dataset: Curated 77K sample mix (55K coding instructions, 13K tool calling, and 9K general anti-forgetting replay samples).
- Objective: Specialize the model on coding accuracy across Python, JS, Shell, etc., and introduce structured tool-calling.
Phase 3: Tool Precision
- Dataset: Focused tool-calling dataset (~20K samples) with schema variations, neg/no-tool examples, and strict JSON format targets.
- Objective: Ensure precise JSON schema conformance and reduce tool false-positives.
Phase 4: Coding/Tool Specialization Continuation
- Starting point: jica98/qwen3.5-4b-claude-distill-lora Phase 3 LoRA.
- Output adapter: qwen3.5-4b-phase4-specialize-lora.
- Training mix: local filtered coding/tool data from filtered_dataset/train.jsonl, Claude distillation replay from data/claude_distill.jsonl, and an Opus replay slice to retain visible reasoning behavior.
- Objective: Continue the distilled LoRA into a stronger coding/tool-specialized adapter while preserving anti-forgetting replay.
- Default recipe: 1024 max sequence length, batch size 1, gradient accumulation 8, learning rate 1e-4, 1 epoch, checkpointing every 200 steps.

Phase 5 Fable Reasoning Fine-Tune

The latest adapter was further fine-tuned for Fable reasoning and agentic coding traces after the Phase 4 specialization pass.

Phase 5 training data:

kelexine/fable-5-sft-traces for cleaned Fable reasoning/SFT traces.
armand0e/claude-fable-5-claude-code for raw Claude/Fable-5 agent traces.
victor/fable-5-boeing-747-trace for the Boeing 747 Claude Code/Fable-5 trace.

Training summary:

Starting point: qwen3.5-4b-phase4-specialize-lora.
Output adapter: qwen3.5-4b-phase5-fable-lora.
After dedupe/sample in the recorded run: 4,721 examples.
After max-length filtering at 4096 tokens: 4,267 examples.
Default recipe: batch size 1, gradient accumulation 8, learning rate 5e-5, 1 epoch, BF16, adamw_8bit.

The Phase 5 data loader normalizes traces into Qwen chat-template text, groups raw Claude event logs into session conversations, deduplicates samples, filters by token length, and skips checkpoint artifacts during Hub upload by default.

Strengths & What It Is Good At

💻 Conversational Programming: Excel at writing clean, efficient, and well-commented code in Python, C++, Rust, JavaScript, Shell, and more.
🧠 Visible Reasoning (Thinking Mode): When faced with complex reasoning or coding tasks, the model engages a <think>...</think> block to outline its plan before writing code.
🛠️ Reliable Tool Calling: Specially tuned to parse and output valid JSON tool parameters conforming to provided function schemas.
📱 Mobile & Edge Execution: With a weight footprint of ~2.6GB and extremely low KV cache overhead, it fits comfortably on 8GB+ RAM edge devices.

Recommended Inference Settings

For the best balance of reasoning depth and formatting precision, use the following generation parameters:

Temperature: 0.6
Top-P: 0.95
Top-K: 20
Min-P: 0.0
Flash Attention: Enable -fa in llama.cpp/llama-cli for optimal speeds.
System Prompt: Set system prompt to guide the assistant (e.g. You are a helpful coding assistant.).

Benchmark Results (Q4_0 GGUF via LM Studio)

Benchmark run against GGUF Q4_0 quant served through LM Studio on consumer AMD ROCm hardware. Results file: benchmark/lmstudio_q4_benchmark/benchmark_report.md

Benchmark	Score	Status
HumanEval+ Pass@1	0.00	ok
MBPP+ Pass@1	0.00	ok
BigCodeBench-Hard	—	needs_review
LiveCodeBench v6	—	not_run
BFCL v4	—	needs_review
IFEval	—	needs_review
MMLU-Pro	—	needs_review
JSON validity	40.00%	ok
No-tool accuracy	87.50%	ok

Notes:

Several benchmarks require environment setup that wasn't completed (IFEval, MMLU-Pro, BFCL, BigCodeBench-Hard).
HumanEval+ and MBPP+ scored 0.00 — the Q4_0 quant may degrade code generation significantly; evaluation with the BF16 base is needed for comparison.
JSON validity and No-tool accuracy are custom deterministic diagnostics.

Downloads last month: 14,512

GGUF

Model size

4B params

Architecture

qwen35

Hardware compatibility

4-bit

Model tree for jica98/qwen3.5-4B-super-coder

Base model

Qwen/Qwen3.5-4B-Base

Finetuned

Qwen/Qwen3.5-4B

Quantized

(281)

this model