Instructions to use jica98/qwen3.5-4B-super-coder with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use jica98/qwen3.5-4B-super-coder with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="jica98/qwen3.5-4B-super-coder", filename="qwen3.5-4B-super-coder.BF16-mmproj.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use jica98/qwen3.5-4B-super-coder with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf jica98/qwen3.5-4B-super-coder:BF16 # Run inference directly in the terminal: llama cli -hf jica98/qwen3.5-4B-super-coder:BF16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf jica98/qwen3.5-4B-super-coder:BF16 # Run inference directly in the terminal: llama cli -hf jica98/qwen3.5-4B-super-coder:BF16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf jica98/qwen3.5-4B-super-coder:BF16 # Run inference directly in the terminal: ./llama-cli -hf jica98/qwen3.5-4B-super-coder:BF16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf jica98/qwen3.5-4B-super-coder:BF16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf jica98/qwen3.5-4B-super-coder:BF16
Use Docker
docker model run hf.co/jica98/qwen3.5-4B-super-coder:BF16
- LM Studio
- Jan
- vLLM
How to use jica98/qwen3.5-4B-super-coder with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "jica98/qwen3.5-4B-super-coder" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "jica98/qwen3.5-4B-super-coder", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/jica98/qwen3.5-4B-super-coder:BF16
- Ollama
How to use jica98/qwen3.5-4B-super-coder with Ollama:
ollama run hf.co/jica98/qwen3.5-4B-super-coder:BF16
- Unsloth Studio
How to use jica98/qwen3.5-4B-super-coder with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for jica98/qwen3.5-4B-super-coder to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for jica98/qwen3.5-4B-super-coder to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for jica98/qwen3.5-4B-super-coder to start chatting
- Pi
How to use jica98/qwen3.5-4B-super-coder with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf jica98/qwen3.5-4B-super-coder:BF16
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "jica98/qwen3.5-4B-super-coder:BF16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use jica98/qwen3.5-4B-super-coder with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf jica98/qwen3.5-4B-super-coder:BF16
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default jica98/qwen3.5-4B-super-coder:BF16
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use jica98/qwen3.5-4B-super-coder with Docker Model Runner:
docker model run hf.co/jica98/qwen3.5-4B-super-coder:BF16
- Lemonade
How to use jica98/qwen3.5-4B-super-coder with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull jica98/qwen3.5-4B-super-coder:BF16
Run and chat with the model
lemonade run user.qwen3.5-4B-super-coder-BF16
List all available models
lemonade list
qwen3.5-4B-super-coder (Q_4.0 GGUF)
qwen3.5-4B-super-coder is a 4-bit quantized GGUF model optimized for fast, reliable coding, structured tool calling, and active reasoning (thinking mode) on consumer/mobile hardware. It is distilled from Claude Sonnet 4.6 & Opus 4.6, and merged/quantized using Unsloth.
Model Summary & Architecture
- Base Model:
Qwen/Qwen3.5-4B - Format: GGUF (Q_4.0 Quantization)
- Size: ~2.6 GB
- Context Window: 32K (optimized for mobile RAM budgets, natively supports up to 262K/1M context via YaRN)
- Key Architectural Advantage: The base
Qwen3.5-4Bmodel uses a hybrid architecture combining Gated DeltaNet (3 layers) and Full Attention (1 layer) repeating. Since only 8 of the 32 layers store a full KV cache, the KV cache footprint is incredibly small (~0.4GB for 32K context), making it exceptionally well-suited for high-context coding on mobile devices (e.g., iPhone 15 Pro+, flagship Android, iPad Pro).
Distillation & Training Procedure
This model was trained using a staged Supervised Fine-Tuning (SFT) pipeline to systematically inject reasoning capability, coding specialization, and tool-calling precision:
┌──────────────────────────────────────────┐
│ Phase A: │
│ General Distillation (Claude Style) │
│ Dataset: Claude-Distills (140K) │
└────────────────────┬─────────────────────┘
│
▼
┌──────────────────────────────────────────┐
│ Phase B: │
│ Specialization (Coding & Tool Calling) │
│ Dataset: Curated Replay Mix (77K) │
└────────────────────┬─────────────────────┘
│
▼
┌──────────────────────────────────────────┐
│ Phase C: │
│ Tool Precision & Schema Conformance │
│ Dataset: Tool-focused Mix (~20K) │
└──────────────────────────────────────────┘
- Phase 1: Distillation (Claude Behavior)
- Dataset:
clzoro/Claude-Distills(140K samples; Sonnet 4.6 + Opus 4.6). - Objective: Transfer general instruction-following, Claude-like formatting/tone, and reasoning capabilities. The Opus subset (21K samples) provided the crucial
<think>block traces to establish thinking capabilities.
- Dataset:
- Phase 2: Specialization (Coding & Tools)
- Dataset: Curated 77K sample mix (55K coding instructions, 13K tool calling, and 9K general anti-forgetting replay samples).
- Objective: Specialize the model on coding accuracy across Python, JS, Shell, etc., and introduce structured tool-calling.
- Phase 3: Tool Precision
- Dataset: Focused tool-calling dataset (~20K samples) with schema variations, neg/no-tool examples, and strict JSON format targets.
- Objective: Ensure precise JSON schema conformance and reduce tool false-positives.
- Phase 4: Coding/Tool Specialization Continuation
- Starting point:
jica98/qwen3.5-4b-claude-distill-loraPhase 3 LoRA. - Output adapter:
qwen3.5-4b-phase4-specialize-lora. - Training mix: local filtered coding/tool data from
filtered_dataset/train.jsonl, Claude distillation replay fromdata/claude_distill.jsonl, and an Opus replay slice to retain visible reasoning behavior. - Objective: Continue the distilled LoRA into a stronger coding/tool-specialized adapter while preserving anti-forgetting replay.
- Default recipe: 1024 max sequence length, batch size 1, gradient accumulation 8, learning rate
1e-4, 1 epoch, checkpointing every 200 steps.
- Starting point:
Phase 5 Fable Reasoning Fine-Tune
The latest adapter was further fine-tuned for Fable reasoning and agentic coding traces after the Phase 4 specialization pass.
Phase 5 training data:
kelexine/fable-5-sft-tracesfor cleaned Fable reasoning/SFT traces.armand0e/claude-fable-5-claude-codefor raw Claude/Fable-5 agent traces.victor/fable-5-boeing-747-tracefor the Boeing 747 Claude Code/Fable-5 trace.
Training summary:
- Starting point:
qwen3.5-4b-phase4-specialize-lora. - Output adapter:
qwen3.5-4b-phase5-fable-lora. - After dedupe/sample in the recorded run: 4,721 examples.
- After max-length filtering at 4096 tokens: 4,267 examples.
- Default recipe: batch size 1, gradient accumulation 8, learning rate
5e-5, 1 epoch, BF16,adamw_8bit.
The Phase 5 data loader normalizes traces into Qwen chat-template text, groups raw Claude event logs into session conversations, deduplicates samples, filters by token length, and skips checkpoint artifacts during Hub upload by default.
Strengths & What It Is Good At
- 💻 Conversational Programming: Excel at writing clean, efficient, and well-commented code in Python, C++, Rust, JavaScript, Shell, and more.
- 🧠 Visible Reasoning (Thinking Mode): When faced with complex reasoning or coding tasks, the model engages a
<think>...</think>block to outline its plan before writing code. - 🛠️ Reliable Tool Calling: Specially tuned to parse and output valid JSON tool parameters conforming to provided function schemas.
- 📱 Mobile & Edge Execution: With a weight footprint of ~2.6GB and extremely low KV cache overhead, it fits comfortably on 8GB+ RAM edge devices.
Recommended Inference Settings
For the best balance of reasoning depth and formatting precision, use the following generation parameters:
- Temperature:
0.6 - Top-P:
0.95 - Top-K:
20 - Min-P:
0.0 - Flash Attention: Enable
-fain llama.cpp/llama-cli for optimal speeds. - System Prompt: Set system prompt to guide the assistant (e.g.
You are a helpful coding assistant.).
Benchmark Results (Q4_0 GGUF via LM Studio)
Benchmark run against GGUF Q4_0 quant served through LM Studio on consumer AMD ROCm hardware. Results file: benchmark/lmstudio_q4_benchmark/benchmark_report.md
| Benchmark | Score | Status |
|---|---|---|
| HumanEval+ Pass@1 | 0.00 | ok |
| MBPP+ Pass@1 | 0.00 | ok |
| BigCodeBench-Hard | — | needs_review |
| LiveCodeBench v6 | — | not_run |
| BFCL v4 | — | needs_review |
| IFEval | — | needs_review |
| MMLU-Pro | — | needs_review |
| JSON validity | 40.00% | ok |
| No-tool accuracy | 87.50% | ok |
Notes:
- Several benchmarks require environment setup that wasn't completed (IFEval, MMLU-Pro, BFCL, BigCodeBench-Hard).
- HumanEval+ and MBPP+ scored 0.00 — the Q4_0 quant may degrade code generation significantly; evaluation with the BF16 base is needed for comparison.
- JSON validity and No-tool accuracy are custom deterministic diagnostics.
- Downloads last month
- 14,512
4-bit