Instructions to use ConeML/coneml-348m-beta with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ConeML/coneml-348m-beta with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="ConeML/coneml-348m-beta")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("ConeML/coneml-348m-beta")
model = AutoModelForCausalLM.from_pretrained("ConeML/coneml-348m-beta")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use ConeML/coneml-348m-beta with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ConeML/coneml-348m-beta"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ConeML/coneml-348m-beta",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ConeML/coneml-348m-beta

SGLang

How to use ConeML/coneml-348m-beta with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ConeML/coneml-348m-beta" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ConeML/coneml-348m-beta",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ConeML/coneml-348m-beta" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ConeML/coneml-348m-beta",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use ConeML/coneml-348m-beta with Docker Model Runner:
```
docker model run hf.co/ConeML/coneml-348m-beta
```

ConeML 348M Beta

ConeML 348M Beta is the second public release in the ConeML research series — a 348M-parameter, scratch-trained small language model. It is the successor to ConeML 348M Alpha (polish900) and improves on it on the held-out reasoning, code, arithmetic, and calibration evaluations reported below. It is a research artifact and beta candidate, not a polished general assistant.

Why ConeML Exists

ConeML is an independent research effort exploring how much capability a compact language model can reach through deliberate data and curriculum design rather than scale alone. The clearest capability carried forward from Alpha is held-out transitive relation binding; Beta extends that capability while improving code generation and arithmetic in the same model.

Evaluations vs Alpha

All numbers are held-out probes (fresh entities disjoint from training), measured with the same protocol on both models.

Transitive inference, chat surface, first-choice accuracy, depths 1–5:

Suite	Alpha	Beta
older / younger relation	79 / 89 / 88 / 77 / 71	93 / 91 / 93 / 86 / 82
unseen query phrasing	56 / 73 / 59 / 48 / 34	69 / 66 / 67 / 72 / 76
non-name entities (colored cards)	51 / 50 / 41 / 31 / 28	62 / 63 / 37 / 34 / 23 (still weak — both)

Other capabilities:

Metric	Alpha	Beta
Code strict-exec (held-out functions)	16.7%	45%
Arithmetic, held-out 10-bucket (sympy-checked)	~21%	33%
Aggregate held-out perplexity	9.17	6.24
Calibration ECE (reasoning / code / agentic)	—	0.037 / 0.032 / 0.015
Output format	indentation unstable	clean first-token answers

Standard public benchmarks (zero-shot, chat format) — reported for comparability, and modest as expected at this scale:

Benchmark	Beta
GSM8K (300-item sample, exact-match)	5.0%
HumanEval (pass@1, 164)	0%

These two numbers measure full multi-step / algorithmic problem-solving, which is beyond a 348M model: GSM8K reflects the unsolved multi-digit arithmetic, and HumanEval requires complete algorithmic solutions (the 45% code figure above is held-out simple function-body completion — a different and easier task). They are published for transparency, not as strengths.

On these evaluations Beta improves over the Alpha on held-out reasoning, code execution, arithmetic, perplexity, and output formatting. On the older/younger relation suite it is higher at every depth; on unseen-query phrasing it is higher at most depths (the Alpha is slightly higher at depth 2). The Alpha's internal fixed-template probe saturated at 100% (depths 1–3); Beta's held-out template accuracy is 99 / 97 / 95 — effectively equal, on a harder probe.

Intended Format

Prompt the model in the chat format below, using the exact User: / Assistant: markers. Raw completion (without the markers) produces degraded output. The template also ships in chat_template.jinja / tokenizer_config.json, so tokenizer.apply_chat_template(...) works directly.

User:
<instruction>
Assistant:

Loading

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "ConeML/coneml-348m-beta"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(repo_id, torch_dtype=torch.float32, device_map="auto")

prompt = "User:\nMia is taller than Ben. Ben is taller than Zoe. Who is tallest? Return only the name.\nAssistant:\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=16, do_sample=False, eos_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

Architecture

Family: Llama-style decoder · Parameters: ~348M · Layers: 30 · Hidden: 1024
Attention heads: 8 · KV heads: 2 · Vocab: 32768 · Context length: 512
Tokenizer: custom 32K

Strengths

Scratch-trained 348M model that improves on its own alpha across the held-out evaluations reported here.
Held-out transitive binding that generalizes across new names, new relations, and unseen query phrasing — higher than the alpha at every depth on the older/younger suite, and at most depths under unseen-query phrasing.
Usable Python function-body generation with stable formatting (45% strict execution on the held-out evaluation reported here).
Materially improved held-out arithmetic over the alpha.
Well-calibrated on reasoning/code/agentic (ECE ≤ 0.04) — uncommon for models this size.

Known Limitations

Multi-digit arithmetic is weak. Held-out 10-bucket arithmetic is 33% overall; reliable 3-digit and multiplication computation is not solved.
Context length is 512 tokens; longer inputs are out of scope for this release.
Transitive binding for non-name entities (e.g., objects) is near chance at depth — binding is still somewhat surface-shaped.
All figures are research results from held-out probes and the standard benchmarks above — not production guarantees.
Research release, not a replacement for larger general assistants.

License

Released for non-commercial use under CC BY-NC 4.0. Commercial use is not granted by this release.

Downloads last month: 21

Safetensors

Model size

0.3B params

Tensor type

F32