Datatest-460m
A 460M-parameter language model trained from scratch on a single NVIDIA RTX 2080 Ti to help with scikit-learn, matplotlib, and general ML coding tasks. Trained on ~24.6B tokens of curated code, ML papers, library docs, and Stack Overflow Q&A; instruction-tuned with SmolTalk + ~950 hand-crafted sklearn/matplotlib examples.
β οΈ Requires
trust_remote_code=True. The architecture has several non-standard features (value embeddings, ReLUΒ², custom RMSNorm) that are not in the standardtransformersmodel registry. The repo ships its ownmodeling_nanochat.pythat loads via the auto class.
Quick start
For best results, use self-consistency (generate N=5 candidates per query and pick the first whose first python block parses cleanly). This is +~14pp on a custom 18-problem sklearn benchmark vs N=1 β the small model degenerates often, but rarely on every parallel sample. Cost: ~2Γ wall time for ~14pp accuracy.
import ast, re, torch
from transformers import AutoModelForCausalLM, AutoTokenizer
repo = "scottejin/Datatest-460m"
tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
repo, trust_remote_code=True, torch_dtype="auto"
).eval()
CODE_BLOCK = re.compile(r"```(?:python)?\s*\n(.*?)\n```", re.DOTALL)
def pick_best(candidates):
for c in candidates:
m = CODE_BLOCK.search(c)
if m:
try:
ast.parse(m.group(1)); return c
except SyntaxError:
pass
return candidates[0]
messages = [{"role": "user", "content": "How do I tune SVM hyperparameters with GridSearchCV? Show a complete pipeline."}]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
attention_mask = inputs.ne(tokenizer.eos_token_id).long()
with torch.inference_mode():
out = model.generate(
inputs, attention_mask=attention_mask,
max_new_tokens=512, do_sample=True,
temperature=0.6, top_k=50, # measured optimal with N=5
num_return_sequences=5, # self-consistency
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.eos_token_id,
)
prompt_len = inputs.shape[1]
candidates = [tokenizer.decode(out[i, prompt_len:], skip_special_tokens=True) for i in range(5)]
print(pick_best(candidates))
For a single-sample / streaming version (faster, lower quality), use num_return_sequences=1 and do_sample=True, temperature=0.6.
What this model is good at
- β sklearn classification recipes: SVMs (linear/RBF/poly), Logistic Regression, Random Forest, Gradient Boosting, KNN, Naive Bayes, Decision Trees
- β
Pipelines:
Pipeline,make_pipeline,ColumnTransformer,FunctionTransformer - β
Cross-validation:
KFold,StratifiedKFold,GridSearchCV,RandomizedSearchCV,validation_curve,learning_curve - β
Metrics:
classification_report,confusion_matrix, ROC/AUC (binary + multiclass OvR), regression metrics (MSE, RMSE, RΒ², MAE) - β
Model interpretation:
permutation_importance,PartialDependenceDisplay,plot_tree,CalibratedClassifierCV - β Matplotlib recipes: subplots, scatter, bar, hist, violin, heatmap, contour, errorbar, custom legends
- β
Debugging common errors:
ConvergenceWarning,NotFittedError, shape mismatches, scaling-after-split leakage - β Conceptual explanations: precision vs recall, RMSE vs MAE, regularization, kernel trick, bias-variance
What this model is NOT good at
- β Anything outside the ML/data-science scope β general chit-chat, history, philosophy, current events. It will confidently confabulate.
- β Sklearn APIs that don't exist β small models invent plausible-looking API names. Always run the generated code to verify.
- β Deep learning β almost no PyTorch/TensorFlow training data. Don't ask about transformers, CNNs, or LLMs.
- β Long-form reasoning β 460M params + 768-token context is small. Multi-step proofs, complex algorithm derivation, etc. are unreliable.
- β Math beyond basic statistics β no symbolic math training.
- β Code that needs to be exactly correct on first try β treat outputs as a starting draft, not production code.
Important: 1024-token context window
Pretrained at sequence_len=768, then SFT-extended to 1024. Keep prompts and history under ~900 tokens for reliable behavior. The architecture supports more (RoPE base 100k), but quality drops outside the trained range.
Recommended generation settings
| Use case | temperature | top_k | num_return_sequences | Notes |
|---|---|---|---|---|
| Best quality (recommended) | 0.6 | 50 | 5 | Self-consistency: +~14pp accuracy, ~2Γ compute |
| Fast (single-sample) | 0.6 | 50 | 1 | Quick streaming chat; degenerates more often |
| Strict / deterministic | 0.0 | 1 | 1 | Greedy β fully reproducible |
| Brainstorming | 0.8 | 50 | 1 | More creative, more hallucination |
Training summary
- Pretraining: 500k iterations Γ 49,152 tokens = ~24.6B tokens, ~21 days on a single 2080 Ti (FP16, 32-step gradient accumulation)
- Pretrain corpus (17.4M docs after dedup): FineWeb-Edu (general web 65%, technical 20%, code 5%, ML-tagged 0.6%) + Cosmopedia STEM 4.5% + library docs (sklearn 40%, matplotlib 25%, numpy/pandas/scipy) 1.4% + ML supplemental (Kaggle notebooks, Starcoder Python ML, ML ArXiv abstracts, StackOverflow Python)
- SFT: SmolTalk 460k + MMLU 1 epoch + GSM8K 4 epochs + 953 hand-crafted sklearn/matplotlib examples Γ 45 epochs (~6% of mixture). 8000 optimizer steps.
- Optimizer: Muon (matrices) + AdamW (embeddings + scalars)
Architecture (why trust_remote_code=True)
This is a 20-layer GPT with several non-standard features. The HF wrapper code (modeling_nanochat.py) is shipped in this repo and loaded automatically via auto_map.
| Feature | Standard | This model |
|---|---|---|
| Activation | SiLU/SwiGLU | ReLUΒ² (square of ReLU) |
| RMSNorm | Has learnable weight |
No learned scale |
| Embedding β LM head | Often tied | Untied |
| RoPE base | 10000 | 100000 |
| Q/K scaling | 1.0 | Γ 1.2 sharpening |
| Per-layer extras | None | Value embeddings (alternating layers, ResFormer-style) gated by ve_gate |
| Cross-layer | Standard residuals | x0_lambdas blends initial embedding into every layer |
| Mid-trunk | None | Backout: subtract layer L/2 residual at logit head |
| Token mixing | None | Smear: cheap bigram via gated previous-token embedding |
| Logits | Linear | Softcap: 15 * tanh(logits/15) |
Spec: n_layer=20, n_embd=1280, n_head=10, n_kv_head=1 (GQA), vocab=32768, seq_len=1024 (SFT-extended from 768 pretrain).
Files in this repo
| File | Purpose |
|---|---|
model.safetensors |
Weights (fp16, ~920MB) |
config.json |
Model configuration |
configuration_nanochat.py |
NanochatConfig class |
modeling_nanochat.py |
NanochatForCausalLM class (the model) |
tokenization_nanochat.py |
NanochatTokenizer (rustbpe wrapper) |
tokenizer_config.json |
HF tokenizer config + Jinja chat template |
tokenizer.pkl |
Pickled tiktoken encoder (must ship β not just tokenizer.json) |
generation_config.json |
Default sampling parameters |
Hardware requirements
- Inference (fp16, single sample):
1.0 GB VRAM (model) + ~250 MB (KV cache for 1024 tokens) + workspace = **1.5 GB** - Inference (fp16, num_return_sequences=5):
1.0 GB (model) + ~1.25 GB (KV cache Γ 5) = **2.5 GB** - Runs comfortably on: any GPU with β₯3 GB VRAM for SC, β₯2 GB for single-sample
- CPU inference: works (use
torch_dtype=torch.float32), expect ~5-15 sec per response
Acknowledgements
Built on karpathy/nanochat. Trained with substantial assistance from Claude Code.
License
MIT (model weights and wrapper code).
- Downloads last month
- 36