Upload folder using huggingface_hub

e65fb72 verified 8 days ago

3.6 kB

	---
	license: apache-2.0
	base_model: Qwen/Qwen2.5-Coder-1.5B-Instruct
	tags:
	- code
	- function-calling
	- tool-use
	- agent
	- small-language-model
	datasets:
	- NousResearch/hermes-function-calling-v1
	language:
	- en
	pipeline_tag: text-generation
	---

	# smolcode-coder-1.5b-tools

	A LoRA fine-tune of Qwen2.5-Coder-1.5B-Instruct that teaches the model to emit
	native `<tool_call>` function calls, so a 1.5B coder model can actually drive an
	agentic write → run → fix → verify loop.

	Built for [smolcode](https://gitea.poyner.ai/sean/smolcode) — an SLM-optimized
	agentic coding assistant — for the Hugging Face Build Small hackathon.

	## Why
	Out of the box, small Qwen-Coder models describe tool calls as plain-text/```json
	instead of emitting the native `<tool_call>` token (id 151657) that runtimes (Ollama,
	llama.cpp) parse into OpenAI-style `tool_calls` — which breaks agentic loops. This
	fine-tune closes that gap on a tiny (1.5B) model: 100% native `<tool_call>` emission
	in free generation on held-out prompts (base model: 0%).

	## Results
	- Native tool-call rate: 100% (16/16 held-out prompts) — the release gate.
	- Agentic bench (smolcode pass@1, 10 tasks): 9/10 as the entry tier of a
	1.5B→8B→30B ladder, solving 7/10 entirely on its own (2–16s each). For
	comparison the all-Granite ladder (3B entry) scores 10/10 — the 1.5B carries the
	same standalone load as a 2×-larger 3B.
	- Train loss: 0.138 (3 epochs, assistant-only loss).

	## Training
	- Base: Qwen/Qwen2.5-Coder-1.5B-Instruct
	- Method: bf16 LoRA (r=16, α=32) on attention + MLP projections, **plus full
	training of `embed_tokens` + `lm_head`** (`modules_to_save`) — required so the model
	can output the `<tool_call>` special token, which LoRA on attention/MLP alone
	cannot. Assistant-only loss (loss on tool calls + final answers only).
	- Data: NousResearch/hermes-function-calling-v1 (breadth) + synthetic smolcode
	tool-use trajectories (sharpness), all rendered through the same
	`apply_chat_template(tools=...)` used at inference — training target is byte-identical
	to the served prompt (fixes the v1 train/inference template mismatch).
	- Schedule: 3 epochs, full 2048 sequence length. Trained on Modal (A100).

	## Serving — read this, two non-obvious requirements
	1. Serve via the GGUF, not the safetensors directly. Ollama's bf16-safetensors
	auto-import produces garbage (`??????`) for this model. Use the included
	`smolcode-1.5b-q4_k_m.gguf` (converted with llama.cpp `convert_hf_to_gguf.py`):
	```bash
	ollama create smolcode-coder-1.5b:tools -f Modelfile # Modelfile is in this repo
	```
	2. `repeat_penalty` / `repetition_penalty` MUST be 1.0. The tool system prompt
	literally contains the `<tool_call>` token, so any penalty > 1 suppresses the model
	from emitting it (you'll see a stray token + bare JSON instead). The included
	`Modelfile` sets `PARAMETER repeat_penalty 1.0`. For raw `transformers.generate`,
	pass `repetition_penalty=1.0`.

	With those, Ollama's `/v1/chat/completions` returns proper native `tool_calls`.

	## Use (transformers)
	Standard Qwen2.5 chat template with `tools=`; greedy, `repetition_penalty=1.0`. The
	model responds with `<tool_call>{"name": ..., "arguments": ...}</tool_call>`.

	## Files
	- `model.safetensors` + tokenizer/config — the merged model (lm_head untied).
	- `smolcode-1.5b-q4_k_m.gguf` — quantized GGUF for serving.
	- `Modelfile` — Ollama import recipe (template + `repeat_penalty 1.0`).

	## License
	Apache-2.0 (inherits from the base model).