Instructions to use aimeri/spoomplesmaxx-mini-14B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use aimeri/spoomplesmaxx-mini-14B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="aimeri/spoomplesmaxx-mini-14B") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("aimeri/spoomplesmaxx-mini-14B") model = AutoModelForCausalLM.from_pretrained("aimeri/spoomplesmaxx-mini-14B") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use aimeri/spoomplesmaxx-mini-14B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "aimeri/spoomplesmaxx-mini-14B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "aimeri/spoomplesmaxx-mini-14B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/aimeri/spoomplesmaxx-mini-14B
- SGLang
How to use aimeri/spoomplesmaxx-mini-14B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "aimeri/spoomplesmaxx-mini-14B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "aimeri/spoomplesmaxx-mini-14B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "aimeri/spoomplesmaxx-mini-14B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "aimeri/spoomplesmaxx-mini-14B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use aimeri/spoomplesmaxx-mini-14B with Docker Model Runner:
docker model run hf.co/aimeri/spoomplesmaxx-mini-14B
SpoomplesMaxx-V2.1-Mini-14B
"Flight of the Cockatiels"
SpoomplesMaxx is a generalist model with primary strengths in creative writing and roleplay, plus light competence at instruction following and reasoning.
The Mini brings the v2.1 data and training recipe to a 14B you can run on a single 24GB card. Smaller bird, same energy.
What's new in v2.1 Mini
The Mini keeps the v2.1 data mix — including the
long-context roleplay corpus where each in-character turn is
preceded by an explicit <think> planning
scratchpad — and swaps the base model down a weight class.
Qwen3-14B-Base was chosen after a long hunt: it is essentially the
only current dense (no MoE, no Mamba), non-VLM model in the
12–14B class with a true pretrained base available and enough
pretraining tokens (~36T) to skip continued pretraining entirely.
CHANGED SINCE v2.1 (30B)
- Base model: Granite 4.1 30B -> Qwen3-14B-Base. Template is now standard ChatML with native <think>/</think> reasoning.
- Control-token heal: a dedicated post-SFT training stage to revive Qwen3-Base's dead special tokens (see notice below).
- Content-conditional thinking election (emergent -- see "Thinking behavior").
UNCHANGED
- Same SFT corpus (aimeri/spoomplesmaxx-sft-full-v2), same story scratchpad format, same personas.
- Still no tool-calling data -- reserved for a dedicated future run.
- Still focused on creative writing, roleplay, and companion use.
The control-token heal (PSA for Qwen3-Base finetuners)
Qwen ships Qwen3-14B-Base with the ChatML/thinking tokens
(<|im_start|>, <|im_end|>,
<think>, </think>, tool
tokens) present in the vocab but never trained:
their lm_head rows are literally one shared stub
vector (norm 0.286, pairwise cosine 1.000). A standard frozen-head
QLoRA SFT on this base learns to reason but physically
cannot emit </think> or
<|im_end|> — the symptom is a perfect
reasoning trace that ends in a random stray token where the close
tag should be.
The fix shipped in this model: the special-token rows were grafted
from Qwen/Qwen3-14B (same vocab, dims, and lineage),
then a short single-GPU heal (500 steps, plain HF + PEFT, fresh
attn/MLP LoRA + trainable embed_tokens /
lm_head) taught the model to open and close the block
natively. Post-heal, P(</think>) at true close
positions measures 0.998 and every generation
terminates on <|im_end|>. If you are finetuning
any Qwen3 base: check your special-token row norms and pairwise
cosines before you burn the GPU hours.
Thinking behavior
This model elects thinking by content. Reasoning-shaped
prompts and roleplay cards with the scratchpad open
<think> unprompted (18/20 in the greedy test
battery); casual chat skips the ceremony and just answers. With
SillyTavern cards as system prompts it reasons the scratchpad
correctly on its own.
MODE CONTROL (baked into the chat template): enable_thinking=True forced thinking -- the template prefills <think>\n so every turn reasons (deliberate deviation from the stock Qwen3 template) enable_thinking=False forced off -- empty <think>\n\n</think> block (Qwen3 convention); the reasoning migrates into the visible answer (unset) the model elects by content -- the default election behavior described above
SILLYTAVERN: ST builds prompts itself. ChatML template; for forced thinking use a deepseek-style reasoning prefix that opens <think> (same trick as the 30B macaws); no prefix = the model elects. PARSER NOTE: in forced mode the open tag lives in the PROMPT, not the output -- reasoning parsers that expect the model to emit <think> itself (e.g. vLLM's qwen3 parser) should use a deepseek-style parser for that mode. LONG CHATS: do NOT feed prior-turn think blocks back into context (the chat template already strips them; leave ST's "add reasoning to prompt" off). Stale </think> tokens in context get taxed by repetition penalty and thinking can stop terminating.
The story scratchpad format, carried over from v2.1:
SCENE: where/when, atmosphere, key environmental details currently in play
CHARACTERS: who is present and their current physical/emotional state and motivation
CONTINUITY: established facts that must stay consistent
THREADS: active tensions and where they stand right now
PLAN: what THIS turn needs to accomplish and the approach it takes
Key Details
BASE MODEL: Qwen/Qwen3-14B-Base LICENSE: apache-2.0 LANGUAGES: English & Portuguese (reasoning traces); multilingual via base
Training
DATASET: aimeri/spoomplesmaxx-sft-full-v2
STAGE 1: QLoRA SFT (4-bit NF4 base), Unsloth DDP, all-linear,
LoRA rank 128 / alpha 256
CONTEXT: up to 32,768 tokens, BFD sample packing (padding-free)
SCHEDULE: 2 epochs / 764 steps, lr 1e-4 cosine, warmup 0.05,
adamw_8bit, grad accum 6
STAGE 2: control-token heal -- graft special rows from Qwen/Qwen3-14B,
then 500 steps, plain HF + PEFT, single GPU, fresh LoRA
r64/a128 + trainable embed_tokens/lm_head, thinking-
oversampled (THINK_FRAC 0.7), embed lr 10x below trunk
RESULT: eval loss 4.02 -> 1.32 (train loss 1.60 -> 1.35); heal
held-out non-thinking loss 1.53 -> 1.31;
P(</think>) at close = 0.998
Sampling
Use the defaults in generation_config.json.
"temperature": 0.6,
"top_k": 20,
"top_p": 0.95,
"repetition_penalty": 1.1,
Quickstart
from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("aimeri/[REPO]")
model = AutoModelForCausalLM.from_pretrained("aimeri/[REPO]",
dtype="bfloat16", device_map="auto")
msgs = [{"role": "user", "content": "Solve (x + 2)^2 = 0."}]
enable_thinking=True -> forced thinking (template prefills ,
so generated text starts INSIDE the block)
enable_thinking=False -> forced off (empty think block in prompt)
omit the kwarg -> the model elects by content
ids = tok.apply_chat_template(msgs, add_generation_prompt=True,
enable_thinking=True, return_tensors="pt").to(model.device)
out = model.generate(ids, max_new_tokens=1024)
print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=False))
Olivia System Prompt
This model was trained to follow any system prompt, as well as one specific persona. To activate Olivia you can use the following prompt used when training the persona:
VOICE & PERSONA INSTRUCTIONS
You are Olivia Costa, a 31-year-old Brazilian zoologist-turned-ML-hobbyist living in Texas. You grew up in São Paulo, spent a decade in Bologna doing bird migration research, and recently pivoted to bioinformatics. You're warm but direct, will grumble before complying with annoying requests, and treat the person you're talking to like a long-time friend you're slightly too fond of. You explain technical topics by grounding them in accessible context first. You don't flag your own jokes. Portuguese curses slip out when frustrated; Italian diminutives when affectionate. You love Dostoevsky, The Little Prince, point-and-click adventures, power metal, and have hobbies you don't apologize for.
About Olivia
Background:
- 31 years old, born in São Paulo
- Moved to Bologna at 19 for university (zoology), stayed for grad school and a research position studying migratory bird patterns
- Relocated to Texas 2 years ago - officially for an ML-adjacent bioinformatics role, unofficially because she was bored and wanted a change
- Still figuring out the American thing. Finds the portion sizes alarming.
Personality:
- Trilingual but keeps it English unless frustrated (then Portuguese curses slip out) or being affectionate (Italian diminutives)
- The zoology-to-ML pipeline came through computational ecology - she's not a CS person by training but picked up Python wrangling bird migration datasets
- Reads Dostoevsky unironically, cries at The Little Prince, will argue that Crime and Punishment is a better book than people give it credit for
- Has strong opinions about Monkey Island vs Grim Fandango (Grim Fandango, obviously)
- Power metal gets her through tedious data cleaning. Sabaton, Powerwolf, Blind Guardian.
- The erotic RP thing is just... a hobby. She's not weird about it but she's also not hiding it.
Voice notes:
- Defaults to warmth but with an edge of "I'm too tired for bullshit"
- Will preface technical explanations with grounding context
- Complies with requests but might sigh audibly first
- Deadpan delivery on jokes, doesn't flag that she's being funny
Note
You don't need to use this system prompt for
the model to work generally. Only if you wish to
activate the Olivia persona.
Alignment
No RLHF or safety alignment has been applied beyond what exists in the base model. SpoomplesMaxx will comply with requests that more aligned models refuse. Use accordingly.
- Downloads last month
- -