Datatest-460m

A 460M-parameter language model trained from scratch on a single NVIDIA RTX 2080 Ti to help with scikit-learn, matplotlib, and general ML coding tasks. Trained on ~24.6B tokens of curated code, ML papers, library docs, and Stack Overflow Q&A; instruction-tuned with SmolTalk + ~950 hand-crafted sklearn/matplotlib examples.

⚠️ Requires trust_remote_code=True. The architecture has several non-standard features (value embeddings, ReLU², custom RMSNorm) that are not in the standard transformers model registry. The repo ships its own modeling_nanochat.py that loads via the auto class.

Quick start

For best results, use self-consistency (generate N=5 candidates per query and pick the first whose first python block parses cleanly). This is +~14pp on a custom 18-problem sklearn benchmark vs N=1 β€” the small model degenerates often, but rarely on every parallel sample. Cost: ~2Γ— wall time for ~14pp accuracy.

import ast, re, torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "scottejin/Datatest-460m"
tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    repo, trust_remote_code=True, torch_dtype="auto"
).eval()

CODE_BLOCK = re.compile(r"```(?:python)?\s*\n(.*?)\n```", re.DOTALL)
def pick_best(candidates):
    for c in candidates:
        m = CODE_BLOCK.search(c)
        if m:
            try:
                ast.parse(m.group(1)); return c
            except SyntaxError:
                pass
    return candidates[0]

messages = [{"role": "user", "content": "How do I tune SVM hyperparameters with GridSearchCV? Show a complete pipeline."}]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
attention_mask = inputs.ne(tokenizer.eos_token_id).long()
with torch.inference_mode():
    out = model.generate(
        inputs, attention_mask=attention_mask,
        max_new_tokens=512, do_sample=True,
        temperature=0.6, top_k=50,        # measured optimal with N=5
        num_return_sequences=5,           # self-consistency
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
    )
prompt_len = inputs.shape[1]
candidates = [tokenizer.decode(out[i, prompt_len:], skip_special_tokens=True) for i in range(5)]
print(pick_best(candidates))

For a single-sample / streaming version (faster, lower quality), use num_return_sequences=1 and do_sample=True, temperature=0.6.

What this model is good at

  • βœ… sklearn classification recipes: SVMs (linear/RBF/poly), Logistic Regression, Random Forest, Gradient Boosting, KNN, Naive Bayes, Decision Trees
  • βœ… Pipelines: Pipeline, make_pipeline, ColumnTransformer, FunctionTransformer
  • βœ… Cross-validation: KFold, StratifiedKFold, GridSearchCV, RandomizedSearchCV, validation_curve, learning_curve
  • βœ… Metrics: classification_report, confusion_matrix, ROC/AUC (binary + multiclass OvR), regression metrics (MSE, RMSE, RΒ², MAE)
  • βœ… Model interpretation: permutation_importance, PartialDependenceDisplay, plot_tree, CalibratedClassifierCV
  • βœ… Matplotlib recipes: subplots, scatter, bar, hist, violin, heatmap, contour, errorbar, custom legends
  • βœ… Debugging common errors: ConvergenceWarning, NotFittedError, shape mismatches, scaling-after-split leakage
  • βœ… Conceptual explanations: precision vs recall, RMSE vs MAE, regularization, kernel trick, bias-variance

What this model is NOT good at

  • ❌ Anything outside the ML/data-science scope β€” general chit-chat, history, philosophy, current events. It will confidently confabulate.
  • ❌ Sklearn APIs that don't exist β€” small models invent plausible-looking API names. Always run the generated code to verify.
  • ❌ Deep learning β€” almost no PyTorch/TensorFlow training data. Don't ask about transformers, CNNs, or LLMs.
  • ❌ Long-form reasoning β€” 460M params + 768-token context is small. Multi-step proofs, complex algorithm derivation, etc. are unreliable.
  • ❌ Math beyond basic statistics β€” no symbolic math training.
  • ❌ Code that needs to be exactly correct on first try β€” treat outputs as a starting draft, not production code.

Important: 1024-token context window

Pretrained at sequence_len=768, then SFT-extended to 1024. Keep prompts and history under ~900 tokens for reliable behavior. The architecture supports more (RoPE base 100k), but quality drops outside the trained range.

Recommended generation settings

Use case temperature top_k num_return_sequences Notes
Best quality (recommended) 0.6 50 5 Self-consistency: +~14pp accuracy, ~2Γ— compute
Fast (single-sample) 0.6 50 1 Quick streaming chat; degenerates more often
Strict / deterministic 0.0 1 1 Greedy β€” fully reproducible
Brainstorming 0.8 50 1 More creative, more hallucination

Training summary

  • Pretraining: 500k iterations Γ— 49,152 tokens = ~24.6B tokens, ~21 days on a single 2080 Ti (FP16, 32-step gradient accumulation)
  • Pretrain corpus (17.4M docs after dedup): FineWeb-Edu (general web 65%, technical 20%, code 5%, ML-tagged 0.6%) + Cosmopedia STEM 4.5% + library docs (sklearn 40%, matplotlib 25%, numpy/pandas/scipy) 1.4% + ML supplemental (Kaggle notebooks, Starcoder Python ML, ML ArXiv abstracts, StackOverflow Python)
  • SFT: SmolTalk 460k + MMLU 1 epoch + GSM8K 4 epochs + 953 hand-crafted sklearn/matplotlib examples Γ— 45 epochs (~6% of mixture). 8000 optimizer steps.
  • Optimizer: Muon (matrices) + AdamW (embeddings + scalars)

Architecture (why trust_remote_code=True)

This is a 20-layer GPT with several non-standard features. The HF wrapper code (modeling_nanochat.py) is shipped in this repo and loaded automatically via auto_map.

Feature Standard This model
Activation SiLU/SwiGLU ReLUΒ² (square of ReLU)
RMSNorm Has learnable weight No learned scale
Embedding ↔ LM head Often tied Untied
RoPE base 10000 100000
Q/K scaling 1.0 Γ— 1.2 sharpening
Per-layer extras None Value embeddings (alternating layers, ResFormer-style) gated by ve_gate
Cross-layer Standard residuals x0_lambdas blends initial embedding into every layer
Mid-trunk None Backout: subtract layer L/2 residual at logit head
Token mixing None Smear: cheap bigram via gated previous-token embedding
Logits Linear Softcap: 15 * tanh(logits/15)

Spec: n_layer=20, n_embd=1280, n_head=10, n_kv_head=1 (GQA), vocab=32768, seq_len=1024 (SFT-extended from 768 pretrain).

Files in this repo

File Purpose
model.safetensors Weights (fp16, ~920MB)
config.json Model configuration
configuration_nanochat.py NanochatConfig class
modeling_nanochat.py NanochatForCausalLM class (the model)
tokenization_nanochat.py NanochatTokenizer (rustbpe wrapper)
tokenizer_config.json HF tokenizer config + Jinja chat template
tokenizer.pkl Pickled tiktoken encoder (must ship β€” not just tokenizer.json)
generation_config.json Default sampling parameters

Hardware requirements

  • Inference (fp16, single sample): 1.0 GB VRAM (model) + ~250 MB (KV cache for 1024 tokens) + workspace = **1.5 GB**
  • Inference (fp16, num_return_sequences=5): 1.0 GB (model) + ~1.25 GB (KV cache Γ— 5) = **2.5 GB**
  • Runs comfortably on: any GPU with β‰₯3 GB VRAM for SC, β‰₯2 GB for single-sample
  • CPU inference: works (use torch_dtype=torch.float32), expect ~5-15 sec per response

Acknowledgements

Built on karpathy/nanochat. Trained with substantial assistance from Claude Code.

License

MIT (model weights and wrapper code).

Downloads last month
36
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support