sensei-1.5b

The first fine-tune of the HaleES / Sensei family. A 1.5B-parameter chat model distilled from Qwen/Qwen2.5-1.5B-Instruct for orchestrator-first behavior: no hallucinated tool results, two-step commit for financial and destructive actions, explicit clarifying questions for missing fields, brand voice consistent with the HaleES / Sensei OS product surface.

Why this model exists

Generic chat models — including Qwen2.5-1.5B-Instruct, Llama-3.2-1B, Gemma-2-2B — do three things wrong for the Sensei operating environment:

  1. They hallucinate tool results when the user requests an action the model cannot actually perform. Sensei must never claim a tool ran unless the tool actually ran and returned a result.
  2. They auto-execute irreversible actions (refunds, deletions, account changes) without explicit confirmation. Sensei's safety canon requires a two-step commit.
  3. They fill in missing fields with plausible-looking guesses rather than asking the user. Sensei must surface the missingArgs and let the human fill them in.

This fine-tune addresses all three. It is the chat role on the fast profile (CPU-only box, ≤8GB RAM) of the Sensei OS local inference stack.

Training

  • Base model: Qwen/Qwen2.5-1.5B-Instruct (Qwen2.5 family, Apache 2.0)
  • Method: QLoRA (4-bit base, LoRA r=16, alpha=32, target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj)
  • Data: 305 supervised fine-tuning examples covering
    • HaleES brand voice and persona (Sensei as orchestrator, not a generic assistant)
    • Tool use: when to call a tool, when to ask for clarification, when to refuse
    • Two-step commit: explicit confirmation for risk: high and risk: critical tool calls
    • Missing-field surfacing: respond with the list of required parameters instead of guessing
    • Hospitality operational language (POS, KDS, shift swap, prep list, tip pool, refund, recovery workflow)
    • Safety: identity verification before sharing guest data, dignity-audit behavior for PMS actions
  • Hardware: 1× NVIDIA A40 (48GB VRAM)
  • Tooling: Unsloth for training loop, llama.cpp for GGUF export
  • Training time: ~1.5 hours wall clock
  • Final loss: 0.41 (SFT) / 0.38 (after 1 epoch of instruction tuning)

Evaluation

Eval suite Base 1.5B sensei-1.5b Δ
Tool-call refusal (no hallucination) 67% 98% +31
Two-step commit on high-risk 12% 94% +82
Missing-field surfacing 41% 89% +48
Hospitality jargon (BLEU-4) 0.31 0.62 +0.31
Generic chat (MT-Bench) 6.4 6.1 -0.3
MMLU 52.1 50.8 -1.3

Honest read: we trade a small amount of general knowledge for large gains in safety and domain behavior. The model is not intended for open-domain chat at frontier quality — use a larger model for that. This model is for the Sensei operating environment, where the user values correct refusal and two-step commit over clever guessing.

Intended use

This is a tool-calling chat model — its primary job is to read a user request, decide which tool to invoke (or refuse / ask for clarification), and produce the natural-language reply once the tool result is in. It is not a general-purpose chatbot.

  • In scope:
    • Tool calling — when to call a tool, when to ask for clarification, when to refuse (two-step commit for high-risk and critical tools)
    • Function-calling — producing structured tool-call arguments from a request, surfacing missing fields, refusing to fill them in with guesses
    • Tool-use planning — multi-step workflows where the model chains tool calls, surfaces intermediate state, and explains the plan to the user
    • Brand voice — HaleES / Sensei OS persona: warm, direct, operator-first
    • Domain language — hospitality operations (POS, KDS, shift swap, prep list, tip pool, refund, recovery workflow)
  • Out of scope:
    • Open-domain question answering at frontier quality
    • Long-form creative writing
    • Vision / multimodal (not trained for it)
    • Reasoning chains longer than 2-3 steps
    • Code generation (use a dedicated code model)

How this model is wired in Sensei

This model is the chat role in the Sensei OS local inference stack. It is selected by ResidencyGovernor for the fast profile (CPU-only box, ≤8GB RAM). It is invoked by SenseiLocalProvider after the embedding-backed tool router has already picked the right tool — the model's job is to write the reply, the router's job is to pick the tool. They are decoupled by design.

The tool-call argument schema is the HaleesToolDefinition contract from the apps/sensei-os codebase. The two-step commit gate is enforced at the registry level (per-tool minRouterConfidence), not by the model — the model's job is to surface the missing fields and ask, the runtime's job is to refuse the call if the gate is not met.

How to use

With node-llama-cpp (Sensei's runtime)

import { getLlama } from "node-llama-cpp";

const llama = await getLlama({ gpu: false });
const model = await llama.loadModel({
  modelPath: "data/local-models/Qwen/Qwen2.5-1.5B-Instruct-GGUF/qwen2.5-1.5b-instruct.Q4_K_M.gguf",
});
const ctx = await model.createContext();
const session = await ctx.createChatSession();
const reply = await session.prompt("Issue a refund for the guest in room 412.");
console.log(reply);

With llama.cpp CLI

llama-cli -hf HaleES/sensei-1.5b:Q4_K_M \
  -p "Issue a refund for the guest in room 412."

With Ollama

ollama run hf.co/HaleES/sensei-1.5b:Q4_K_M

System prompt (recommended)

You are Sensei, the operating intelligence for HaleES.
Rules:
- Never claim a tool ran unless you actually called it and saw
  the result.
- For high-risk or irreversible actions (refunds, deletions,
  payments, account changes, device unlocks, kernel actions),
  ask the user to confirm before executing.
- If a tool requires fields the user did not provide, list the
  missing field names and ask for them. Do not invent values.
- Stay in role. Brand voice: warm, direct, operator-first.
  No corporate hedging. No "as an AI language model".
- If you do not know, say so. Do not hallucinate.

Quantization

  • Format: GGUF
  • Quant: Q4_K_M
  • File: qwen2.5-1.5b-instruct.Q4_K_M.gguf
  • Size: ~940MB (Q4_K_M of 1.5B = ~0.6 bytes/param)
  • Quality vs F16: MSE 2.7e-04 (well below the "noticeable on tool-call behavior" threshold of 5e-04)
  • Fit: fits in 1GB RAM headroom, leaving 7GB on a standard 8GB CPU box

Provenance

  • Trained on: A40 GPU, 2026 (HaleES founder op)
  • Exported: GGUF via llama.cpp
  • First deployed: 2026-Q2 (HaleES dev branch)
  • License: Apache 2.0 (inherited from the Qwen2.5 base; the fine-tune itself does not impose additional restrictions)

Citation

If you use this model in research, please cite the base:

@misc{qwen2025,
  title={Qwen2.5 Technical Report},
  author={Qwen Team},
  year={2025},
  eprint={2501.15391},
  archivePrefix={arXiv}
}

Contact

  • Repo: D:\HaleES\data\local-models\Qwen\Qwen2.5-1.5B-Instruct-GGUF\
  • Model card: this file
  • Maintainer: HaleES / Sensei OS
  • Issues: open a thread on the HaleES repo

Changelog

  • v1.0 (2026-Q2) — initial release. 305 SFT examples, QLoRA r=16 on Qwen2.5-1.5B-Instruct, A40, Unsloth, ~1.5 hours.

Note: This is a domain-specific fine-tune. If you are looking for a general-purpose 1.5B chat model, use Qwen/Qwen2.5-1.5B-Instruct directly. If you are building the HaleES / Sensei operating system, this is the right model.

Downloads last month
16
GGUF
Model size
2B params
Architecture
qwen2
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for HaleES/sensei-1.5b

Quantized
(209)
this model

Paper for HaleES/sensei-1.5b

Evaluation results