bg

Atom 3.4m

Atom is a 3.4M parameter causal language model developed by Universal Computing Research. It was pretrained from scratch as a compact research model for studying language-model architecture, data curricula, and small-model benchmarking.

Model details

  • Architecture: causal decoder-only language model
  • Parameters: 3,412,800
  • Layers: 7
  • Hidden size: 192
  • Attention: 3 query heads and 1 key-value head (grouped-query attention)
  • Head dimension: 64
  • Feed-forward size: 480
  • Context length: 512 tokens
  • Positional encoding: rotary position embeddings (RoPE)
  • RoPE Theta = 5000.0
  • Normalization: RMSNorm
  • Activation: gated SiLU feed-forward network
  • Vocabulary size: 4,096 tokens
  • Tokenizer: custom byte-level BPE, exposed as GPT2TokenizerFast
  • Training tokens: approximately 5 billion
  • License: Apache-2.0

The model uses tied input and output embeddings. Its custom attention implementation combines grouped-query attention with XSE.

Tokenizer

Atom uses a custom byte-level BPE tokenizer trained specifically for this pretraining corpus. The tokenizer has a vocabulary of 4,096 tokens and includes dedicated padding, beginning-of-sequence, end-of-sequence, unknown, and end-of-text tokens.

Training data and curriculum

Atom was trained on a curriculum combining general web text, educational material, synthetic textbook-style content, and mathematical data. The mixture changed gradually during training: general web data was emphasized earlier, while educational, synthetic, and mathematical material received more weight later.

Approximate proportions over the complete training run were:

Dataset Subset / split used Approximate proportion
HuggingFaceFW/fineweb-edu All available CC-MAIN-* configurations under data/, train split 39%
openbmb/Ultra-FineWeb English v1.4 (ultrafineweb_en_v1_4; en split) 31%
HuggingFaceTB/finemath finemath-3plus, train split 12%
HuggingFaceTB/smollm-corpus cosmopedia-v2, train split 12%
openbmb/UltraData-Math UltraData-Math-L2-preview, train split 6%

These percentages describe the approximate aggregate sampling mixture rather than exact document counts. Refer to the individual dataset cards for their source information, licenses, and usage conditions.

Intended use

This is a small base language model intended for research and benchmarking. It may be useful for experiments involving compact architectures, pretraining curricula, tokenization, evaluation pipelines, and resource-constrained inference.

Atom is a base model and has not been instruction-tuned or aligned for assistant-style interaction.

Evaluation

Atom was evaluated with EleutherAI's lm-evaluation-harness and ArithMark-2.0.

lm-evaluation-harness

Task Metric Score
ARC-Easy acc_norm 33.08%
ARC-Challenge acc_norm 21.76%
HellaSwag acc_norm 27.65%
PIQA acc_norm 55.71%

ArithMark-2.0

Benchmark Metric Score
ArithMark-2.0 acc 27.36%

Average score: 34.54%

Limitations

Atom is a very small model and should not be expected to produce reliable factual, safety-critical, or instruction-following outputs. Its short context window and limited capacity constrain coherence, knowledge recall, reasoning, and long-form generation.

The model may reproduce errors, biases, or undesirable patterns present in its training data. It has not undergone dedicated safety training and should not be used for high-stakes decisions.

Downloads last month
19
Safetensors
Model size
3.41M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train UniversalComputingResearch/Atom3.4m