ConeML 348M Beta

ConeML 348M Beta is the second public release in the ConeML research series — a 348M-parameter, scratch-trained small language model. It is the successor to ConeML 348M Alpha (polish900) and improves on it on the held-out reasoning, code, arithmetic, and calibration evaluations reported below. It is a research artifact and beta candidate, not a polished general assistant.

Why ConeML Exists

ConeML is an independent research effort exploring how much capability a compact language model can reach through deliberate data and curriculum design rather than scale alone. The clearest capability carried forward from Alpha is held-out transitive relation binding; Beta extends that capability while improving code generation and arithmetic in the same model.

Evaluations vs Alpha

All numbers are held-out probes (fresh entities disjoint from training), measured with the same protocol on both models.

Transitive inference, chat surface, first-choice accuracy, depths 1–5:

Suite Alpha Beta
older / younger relation 79 / 89 / 88 / 77 / 71 93 / 91 / 93 / 86 / 82
unseen query phrasing 56 / 73 / 59 / 48 / 34 69 / 66 / 67 / 72 / 76
non-name entities (colored cards) 51 / 50 / 41 / 31 / 28 62 / 63 / 37 / 34 / 23 (still weak — both)

Other capabilities:

Metric Alpha Beta
Code strict-exec (held-out functions) 16.7% 45%
Arithmetic, held-out 10-bucket (sympy-checked) ~21% 33%
Aggregate held-out perplexity 9.17 6.24
Calibration ECE (reasoning / code / agentic) — 0.037 / 0.032 / 0.015
Output format indentation unstable clean first-token answers

Standard public benchmarks (zero-shot, chat format) — reported for comparability, and modest as expected at this scale:

Benchmark Beta
GSM8K (300-item sample, exact-match) 5.0%
HumanEval (pass@1, 164) 0%

These two numbers measure full multi-step / algorithmic problem-solving, which is beyond a 348M model: GSM8K reflects the unsolved multi-digit arithmetic, and HumanEval requires complete algorithmic solutions (the 45% code figure above is held-out simple function-body completion — a different and easier task). They are published for transparency, not as strengths.

On these evaluations Beta improves over the Alpha on held-out reasoning, code execution, arithmetic, perplexity, and output formatting. On the older/younger relation suite it is higher at every depth; on unseen-query phrasing it is higher at most depths (the Alpha is slightly higher at depth 2). The Alpha's internal fixed-template probe saturated at 100% (depths 1–3); Beta's held-out template accuracy is 99 / 97 / 95 — effectively equal, on a harder probe.

Intended Format

Prompt the model in the chat format below, using the exact User: / Assistant: markers. Raw completion (without the markers) produces degraded output. The template also ships in chat_template.jinja / tokenizer_config.json, so tokenizer.apply_chat_template(...) works directly.

User:
<instruction>
Assistant:

Loading

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "ConeML/coneml-348m-beta"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(repo_id, torch_dtype=torch.float32, device_map="auto")

prompt = "User:\nMia is taller than Ben. Ben is taller than Zoe. Who is tallest? Return only the name.\nAssistant:\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=16, do_sample=False, eos_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

Architecture

  • Family: Llama-style decoder · Parameters: ~348M · Layers: 30 · Hidden: 1024
  • Attention heads: 8 · KV heads: 2 · Vocab: 32768 · Context length: 512
  • Tokenizer: custom 32K

Strengths

  • Scratch-trained 348M model that improves on its own alpha across the held-out evaluations reported here.
  • Held-out transitive binding that generalizes across new names, new relations, and unseen query phrasing — higher than the alpha at every depth on the older/younger suite, and at most depths under unseen-query phrasing.
  • Usable Python function-body generation with stable formatting (45% strict execution on the held-out evaluation reported here).
  • Materially improved held-out arithmetic over the alpha.
  • Well-calibrated on reasoning/code/agentic (ECE ≤ 0.04) — uncommon for models this size.

Known Limitations

  • Multi-digit arithmetic is weak. Held-out 10-bucket arithmetic is 33% overall; reliable 3-digit and multiplication computation is not solved.
  • Context length is 512 tokens; longer inputs are out of scope for this release.
  • Transitive binding for non-name entities (e.g., objects) is near chance at depth — binding is still somewhat surface-shaped.
  • All figures are research results from held-out probes and the standard benchmarks above — not production guarantees.
  • Research release, not a replacement for larger general assistants.

License

Released for non-commercial use under CC BY-NC 4.0. Commercial use is not granted by this release.

Downloads last month
21
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support