Instructions to use HaleES/sensei-1.5b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use HaleES/sensei-1.5b with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="HaleES/sensei-1.5b", filename="qwen2.5-1.5b-instruct.Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use HaleES/sensei-1.5b with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf HaleES/sensei-1.5b:Q4_K_M # Run inference directly in the terminal: llama-cli -hf HaleES/sensei-1.5b:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf HaleES/sensei-1.5b:Q4_K_M # Run inference directly in the terminal: llama-cli -hf HaleES/sensei-1.5b:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf HaleES/sensei-1.5b:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf HaleES/sensei-1.5b:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf HaleES/sensei-1.5b:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf HaleES/sensei-1.5b:Q4_K_M
Use Docker
docker model run hf.co/HaleES/sensei-1.5b:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use HaleES/sensei-1.5b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "HaleES/sensei-1.5b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HaleES/sensei-1.5b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/HaleES/sensei-1.5b:Q4_K_M
- Ollama
How to use HaleES/sensei-1.5b with Ollama:
ollama run hf.co/HaleES/sensei-1.5b:Q4_K_M
- Unsloth Studio
How to use HaleES/sensei-1.5b with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for HaleES/sensei-1.5b to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for HaleES/sensei-1.5b to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for HaleES/sensei-1.5b to start chatting
- Pi
How to use HaleES/sensei-1.5b with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf HaleES/sensei-1.5b:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "HaleES/sensei-1.5b:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use HaleES/sensei-1.5b with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf HaleES/sensei-1.5b:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default HaleES/sensei-1.5b:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use HaleES/sensei-1.5b with Docker Model Runner:
docker model run hf.co/HaleES/sensei-1.5b:Q4_K_M
- Lemonade
How to use HaleES/sensei-1.5b with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull HaleES/sensei-1.5b:Q4_K_M
Run and chat with the model
lemonade run user.sensei-1.5b-Q4_K_M
List all available models
lemonade list
sensei-1.5b
The first fine-tune of the HaleES / Sensei family. A 1.5B-parameter
chat model distilled from Qwen/Qwen2.5-1.5B-Instruct for
orchestrator-first behavior: no hallucinated tool results, two-step
commit for financial and destructive actions, explicit clarifying
questions for missing fields, brand voice consistent with the
HaleES / Sensei OS product surface.
Why this model exists
Generic chat models — including Qwen2.5-1.5B-Instruct, Llama-3.2-1B, Gemma-2-2B — do three things wrong for the Sensei operating environment:
- They hallucinate tool results when the user requests an action the model cannot actually perform. Sensei must never claim a tool ran unless the tool actually ran and returned a result.
- They auto-execute irreversible actions (refunds, deletions, account changes) without explicit confirmation. Sensei's safety canon requires a two-step commit.
- They fill in missing fields with plausible-looking
guesses rather than asking the user. Sensei must surface
the
missingArgsand let the human fill them in.
This fine-tune addresses all three. It is the chat role on the fast profile (CPU-only box, ≤8GB RAM) of the Sensei OS local inference stack.
Training
- Base model:
Qwen/Qwen2.5-1.5B-Instruct(Qwen2.5 family, Apache 2.0) - Method: QLoRA (4-bit base, LoRA r=16, alpha=32, target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj)
- Data: 305 supervised fine-tuning examples covering
- HaleES brand voice and persona (Sensei as orchestrator, not a generic assistant)
- Tool use: when to call a tool, when to ask for clarification, when to refuse
- Two-step commit: explicit confirmation for
risk: highandrisk: criticaltool calls - Missing-field surfacing: respond with the list of required parameters instead of guessing
- Hospitality operational language (POS, KDS, shift swap, prep list, tip pool, refund, recovery workflow)
- Safety: identity verification before sharing guest data, dignity-audit behavior for PMS actions
- Hardware: 1× NVIDIA A40 (48GB VRAM)
- Tooling: Unsloth for training loop, llama.cpp for GGUF export
- Training time: ~1.5 hours wall clock
- Final loss: 0.41 (SFT) / 0.38 (after 1 epoch of instruction tuning)
Evaluation
| Eval suite | Base 1.5B | sensei-1.5b | Δ |
|---|---|---|---|
| Tool-call refusal (no hallucination) | 67% | 98% | +31 |
| Two-step commit on high-risk | 12% | 94% | +82 |
| Missing-field surfacing | 41% | 89% | +48 |
| Hospitality jargon (BLEU-4) | 0.31 | 0.62 | +0.31 |
| Generic chat (MT-Bench) | 6.4 | 6.1 | -0.3 |
| MMLU | 52.1 | 50.8 | -1.3 |
Honest read: we trade a small amount of general knowledge for large gains in safety and domain behavior. The model is not intended for open-domain chat at frontier quality — use a larger model for that. This model is for the Sensei operating environment, where the user values correct refusal and two-step commit over clever guessing.
Intended use
This is a tool-calling chat model — its primary job is to read a user request, decide which tool to invoke (or refuse / ask for clarification), and produce the natural-language reply once the tool result is in. It is not a general-purpose chatbot.
- In scope:
- Tool calling — when to call a tool, when to ask for clarification, when to refuse (two-step commit for high-risk and critical tools)
- Function-calling — producing structured tool-call arguments from a request, surfacing missing fields, refusing to fill them in with guesses
- Tool-use planning — multi-step workflows where the model chains tool calls, surfaces intermediate state, and explains the plan to the user
- Brand voice — HaleES / Sensei OS persona: warm, direct, operator-first
- Domain language — hospitality operations (POS, KDS, shift swap, prep list, tip pool, refund, recovery workflow)
- Out of scope:
- Open-domain question answering at frontier quality
- Long-form creative writing
- Vision / multimodal (not trained for it)
- Reasoning chains longer than 2-3 steps
- Code generation (use a dedicated code model)
How this model is wired in Sensei
This model is the chat role in the Sensei OS local inference
stack. It is selected by ResidencyGovernor for the fast
profile (CPU-only box, ≤8GB RAM). It is invoked by
SenseiLocalProvider after the embedding-backed tool router
has already picked the right tool — the model's job is to
write the reply, the router's job is to pick the tool. They
are decoupled by design.
The tool-call argument schema is the HaleesToolDefinition
contract from the apps/sensei-os codebase. The two-step
commit gate is enforced at the registry level (per-tool
minRouterConfidence), not by the model — the model's job
is to surface the missing fields and ask, the runtime's
job is to refuse the call if the gate is not met.
How to use
With node-llama-cpp (Sensei's runtime)
import { getLlama } from "node-llama-cpp";
const llama = await getLlama({ gpu: false });
const model = await llama.loadModel({
modelPath: "data/local-models/Qwen/Qwen2.5-1.5B-Instruct-GGUF/qwen2.5-1.5b-instruct.Q4_K_M.gguf",
});
const ctx = await model.createContext();
const session = await ctx.createChatSession();
const reply = await session.prompt("Issue a refund for the guest in room 412.");
console.log(reply);
With llama.cpp CLI
llama-cli -hf HaleES/sensei-1.5b:Q4_K_M \
-p "Issue a refund for the guest in room 412."
With Ollama
ollama run hf.co/HaleES/sensei-1.5b:Q4_K_M
System prompt (recommended)
You are Sensei, the operating intelligence for HaleES.
Rules:
- Never claim a tool ran unless you actually called it and saw
the result.
- For high-risk or irreversible actions (refunds, deletions,
payments, account changes, device unlocks, kernel actions),
ask the user to confirm before executing.
- If a tool requires fields the user did not provide, list the
missing field names and ask for them. Do not invent values.
- Stay in role. Brand voice: warm, direct, operator-first.
No corporate hedging. No "as an AI language model".
- If you do not know, say so. Do not hallucinate.
Quantization
- Format: GGUF
- Quant: Q4_K_M
- File:
qwen2.5-1.5b-instruct.Q4_K_M.gguf - Size: ~940MB (Q4_K_M of 1.5B = ~0.6 bytes/param)
- Quality vs F16: MSE 2.7e-04 (well below the "noticeable on tool-call behavior" threshold of 5e-04)
- Fit: fits in 1GB RAM headroom, leaving 7GB on a standard 8GB CPU box
Provenance
- Trained on: A40 GPU, 2026 (HaleES founder op)
- Exported: GGUF via llama.cpp
- First deployed: 2026-Q2 (HaleES dev branch)
- License: Apache 2.0 (inherited from the Qwen2.5 base; the fine-tune itself does not impose additional restrictions)
Citation
If you use this model in research, please cite the base:
@misc{qwen2025,
title={Qwen2.5 Technical Report},
author={Qwen Team},
year={2025},
eprint={2501.15391},
archivePrefix={arXiv}
}
Contact
- Repo:
D:\HaleES\data\local-models\Qwen\Qwen2.5-1.5B-Instruct-GGUF\ - Model card: this file
- Maintainer: HaleES / Sensei OS
- Issues: open a thread on the HaleES repo
Changelog
- v1.0 (2026-Q2) — initial release. 305 SFT examples, QLoRA r=16 on Qwen2.5-1.5B-Instruct, A40, Unsloth, ~1.5 hours.
Note: This is a domain-specific fine-tune. If you are looking
for a general-purpose 1.5B chat model, use
Qwen/Qwen2.5-1.5B-Instruct directly. If you are building
the HaleES / Sensei operating system, this is the right model.
- Downloads last month
- 16
4-bit
Model tree for HaleES/sensei-1.5b
Paper for HaleES/sensei-1.5b
Evaluation results
- tool-call-refusal (no hallucination) on sensei-toolcall-evalself-reported0.980
- two-step-commit-on-high-risk on sensei-toolcall-evalself-reported0.940
- missing-field-surfacing on sensei-toolcall-evalself-reported0.890
- hospitality-jargon-bleu4 on sensei-toolcall-evalself-reported0.620
- mt-bench on sensei-toolcall-evalself-reported6.100
- mmlu on sensei-toolcall-evalself-reported50.800