Datatest-460m

A 460M-parameter language model trained from scratch on a single NVIDIA RTX 2080 Ti to help with scikit-learn, matplotlib, and general ML coding tasks. Trained on ~24.6B tokens of curated code, ML papers, library docs, and Stack Overflow Q&A; instruction-tuned with SmolTalk + ~950 hand-crafted sklearn/matplotlib examples.

⚠️ Requires trust_remote_code=True. The architecture has several non-standard features (value embeddings, ReLU², custom RMSNorm) that are not in the standard transformers model registry. The repo ships its own modeling_nanochat.py that loads via the auto class.

Quick start

For best results, use self-consistency (generate N=5 candidates per query and pick the first whose first python block parses cleanly). This is +~14pp on a custom 18-problem sklearn benchmark vs N=1 — the small model degenerates often, but rarely on every parallel sample. Cost: ~2× wall time for ~14pp accuracy.

import ast, re, torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "scottejin/Datatest-460m"
tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    repo, trust_remote_code=True, torch_dtype="auto"
).eval()

CODE_BLOCK = re.compile(r"```(?:python)?\s*\n(.*?)\n```", re.DOTALL)
def pick_best(candidates):
    for c in candidates:
        m = CODE_BLOCK.search(c)
        if m:
            try:
                ast.parse(m.group(1)); return c
            except SyntaxError:
                pass
    return candidates[0]

messages = [{"role": "user", "content": "How do I tune SVM hyperparameters with GridSearchCV? Show a complete pipeline."}]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
attention_mask = inputs.ne(tokenizer.eos_token_id).long()
with torch.inference_mode():
    out = model.generate(
        inputs, attention_mask=attention_mask,
        max_new_tokens=512, do_sample=True,
        temperature=0.6, top_k=50,        # measured optimal with N=5
        num_return_sequences=5,           # self-consistency
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
    )
prompt_len = inputs.shape[1]
candidates = [tokenizer.decode(out[i, prompt_len:], skip_special_tokens=True) for i in range(5)]
print(pick_best(candidates))

For a single-sample / streaming version (faster, lower quality), use num_return_sequences=1 and do_sample=True, temperature=0.6.

What this model is good at

✅ sklearn classification recipes: SVMs (linear/RBF/poly), Logistic Regression, Random Forest, Gradient Boosting, KNN, Naive Bayes, Decision Trees
✅ Pipelines: Pipeline, make_pipeline, ColumnTransformer, FunctionTransformer
✅ Cross-validation: KFold, StratifiedKFold, GridSearchCV, RandomizedSearchCV, validation_curve, learning_curve
✅ Metrics: classification_report, confusion_matrix, ROC/AUC (binary + multiclass OvR), regression metrics (MSE, RMSE, R², MAE)
✅ Model interpretation: permutation_importance, PartialDependenceDisplay, plot_tree, CalibratedClassifierCV
✅ Matplotlib recipes: subplots, scatter, bar, hist, violin, heatmap, contour, errorbar, custom legends
✅ Debugging common errors: ConvergenceWarning, NotFittedError, shape mismatches, scaling-after-split leakage
✅ Conceptual explanations: precision vs recall, RMSE vs MAE, regularization, kernel trick, bias-variance

What this model is NOT good at

❌ Anything outside the ML/data-science scope — general chit-chat, history, philosophy, current events. It will confidently confabulate.
❌ Sklearn APIs that don't exist — small models invent plausible-looking API names. Always run the generated code to verify.
❌ Deep learning — almost no PyTorch/TensorFlow training data. Don't ask about transformers, CNNs, or LLMs.
❌ Long-form reasoning — 460M params + 768-token context is small. Multi-step proofs, complex algorithm derivation, etc. are unreliable.
❌ Math beyond basic statistics — no symbolic math training.
❌ Code that needs to be exactly correct on first try — treat outputs as a starting draft, not production code.

Important: 1024-token context window

Pretrained at sequence_len=768, then SFT-extended to 1024. Keep prompts and history under ~900 tokens for reliable behavior. The architecture supports more (RoPE base 100k), but quality drops outside the trained range.

Recommended generation settings

Use case	temperature	top_k	num_return_sequences	Notes
Best quality (recommended)	0.6	50	5	Self-consistency: +~14pp accuracy, ~2× compute
Fast (single-sample)	0.6	50	1	Quick streaming chat; degenerates more often
Strict / deterministic	0.0	1	1	Greedy — fully reproducible
Brainstorming	0.8	50	1	More creative, more hallucination

Training summary

Pretraining: 500k iterations × 49,152 tokens = ~24.6B tokens, ~21 days on a single 2080 Ti (FP16, 32-step gradient accumulation)
Pretrain corpus (17.4M docs after dedup): FineWeb-Edu (general web 65%, technical 20%, code 5%, ML-tagged 0.6%) + Cosmopedia STEM 4.5% + library docs (sklearn 40%, matplotlib 25%, numpy/pandas/scipy) 1.4% + ML supplemental (Kaggle notebooks, Starcoder Python ML, ML ArXiv abstracts, StackOverflow Python)
SFT: SmolTalk 460k + MMLU 1 epoch + GSM8K 4 epochs + 953 hand-crafted sklearn/matplotlib examples × 45 epochs (~6% of mixture). 8000 optimizer steps.
Optimizer: Muon (matrices) + AdamW (embeddings + scalars)

Architecture (why `trust_remote_code=True`)

This is a 20-layer GPT with several non-standard features. The HF wrapper code (modeling_nanochat.py) is shipped in this repo and loaded automatically via auto_map.

Feature	Standard	This model
Activation	SiLU/SwiGLU	ReLU² (square of ReLU)
RMSNorm	Has learnable `weight`	No learned scale
Embedding ↔ LM head	Often tied	Untied
RoPE base	10000	100000
Q/K scaling	1.0	× 1.2 sharpening
Per-layer extras	None	Value embeddings (alternating layers, ResFormer-style) gated by `ve_gate`
Cross-layer	Standard residuals	`x0_lambdas` blends initial embedding into every layer
Mid-trunk	None	Backout: subtract layer L/2 residual at logit head
Token mixing	None	Smear: cheap bigram via gated previous-token embedding
Logits	Linear	Softcap: `15 * tanh(logits/15)`

Spec: n_layer=20, n_embd=1280, n_head=10, n_kv_head=1 (GQA), vocab=32768, seq_len=1024 (SFT-extended from 768 pretrain).

Files in this repo

File	Purpose
`model.safetensors`	Weights (fp16, ~920MB)
`config.json`	Model configuration
`configuration_nanochat.py`	`NanochatConfig` class
`modeling_nanochat.py`	`NanochatForCausalLM` class (the model)
`tokenization_nanochat.py`	`NanochatTokenizer` (rustbpe wrapper)
`tokenizer_config.json`	HF tokenizer config + Jinja chat template
`tokenizer.pkl`	Pickled tiktoken encoder (must ship — not just `tokenizer.json`)
`generation_config.json`	Default sampling parameters

Hardware requirements

Inference (fp16, single sample): 1.0 GB VRAM (model) + ~250 MB (KV cache for 1024 tokens) + workspace = **1.5 GB**
Inference (fp16, num_return_sequences=5): 1.0 GB (model) + ~1.25 GB (KV cache × 5) = **2.5 GB**
Runs comfortably on: any GPU with ≥3 GB VRAM for SC, ≥2 GB for single-sample
CPU inference: works (use torch_dtype=torch.float32), expect ~5-15 sec per response

Acknowledgements

Built on karpathy/nanochat. Trained with substantial assistance from Claude Code.

License

MIT (model weights and wrapper code).

Downloads last month: 36