--- license: apache-2.0 datasets: - HuggingFaceFW/fineweb-edu - openbmb/Ultra-FineWeb - HuggingFaceTB/finemath - HuggingFaceTB/smollm-corpus - openbmb/UltraData-Math language: - en library_name: transformers tags: - causal-lm - decoder-only - grouped-query-attention - rope - swiglu - custom-tokenizer - curriculum-learning - xsa pipeline_tag: text-generation --- ![bg](bg.png) # Atom 3.4m Atom is a 3.4M parameter causal language model developed by **Universal Computing Research**. It was pretrained from scratch as a compact research model for studying language-model architecture, data curricula, and small-model benchmarking. ## Model details - Architecture: causal decoder-only language model - Parameters: 3,412,800 - Layers: 7 - Hidden size: 192 - Attention: 3 query heads and 1 key-value head (grouped-query attention) - Head dimension: 64 - Feed-forward size: 480 - Context length: 512 tokens - Positional encoding: rotary position embeddings (RoPE) - RoPE Theta = 5000.0 - Normalization: RMSNorm - Activation: gated SiLU feed-forward network - Vocabulary size: 4,096 tokens - Tokenizer: custom byte-level BPE, exposed as `GPT2TokenizerFast` - Training tokens: approximately 5 billion - License: Apache-2.0 The model uses tied input and output embeddings. Its custom attention implementation combines grouped-query attention with XSE. ## Tokenizer Atom uses a custom byte-level BPE tokenizer trained specifically for this pretraining corpus. The tokenizer has a vocabulary of 4,096 tokens and includes dedicated padding, beginning-of-sequence, end-of-sequence, unknown, and end-of-text tokens. ## Training data and curriculum Atom was trained on a curriculum combining general web text, educational material, synthetic textbook-style content, and mathematical data. The mixture changed gradually during training: general web data was emphasized earlier, while educational, synthetic, and mathematical material received more weight later. Approximate proportions over the complete training run were: | Dataset | Subset / split used | Approximate proportion | |---|---|---:| | [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) | All available `CC-MAIN-*` configurations under `data/`, `train` split | 39% | | [openbmb/Ultra-FineWeb](https://huggingface.co/datasets/openbmb/Ultra-FineWeb) | English v1.4 (`ultrafineweb_en_v1_4`; `en` split) | 31% | | [HuggingFaceTB/finemath](https://huggingface.co/datasets/HuggingFaceTB/finemath) | `finemath-3plus`, `train` split | 12% | | [HuggingFaceTB/smollm-corpus](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus) | `cosmopedia-v2`, `train` split | 12% | | [openbmb/UltraData-Math](https://huggingface.co/datasets/openbmb/UltraData-Math) | `UltraData-Math-L2-preview`, `train` split | 6% | These percentages describe the approximate aggregate sampling mixture rather than exact document counts. Refer to the individual dataset cards for their source information, licenses, and usage conditions. ## Intended use This is a small base language model intended for research and benchmarking. It may be useful for experiments involving compact architectures, pretraining curricula, tokenization, evaluation pipelines, and resource-constrained inference. Atom is a base model and has not been instruction-tuned or aligned for assistant-style interaction. ## Evaluation Atom was evaluated with EleutherAI's `lm-evaluation-harness` and ArithMark-2.0. ### lm-evaluation-harness | Task | Metric | Score | |---|---|---:| | ARC-Easy | `acc_norm` | 33.08% | | ARC-Challenge | `acc_norm` | 21.76% | | HellaSwag | `acc_norm` | 27.65% | | PIQA | `acc_norm` | 55.71% | ### ArithMark-2.0 | Benchmark | Metric | Score | |---|---|---:| | ArithMark-2.0 | `acc` | 27.36% | **Average score: 34.54%** ## Limitations Atom is a very small model and should not be expected to produce reliable factual, safety-critical, or instruction-following outputs. Its short context window and limited capacity constrain coherence, knowledge recall, reasoning, and long-form generation. The model may reproduce errors, biases, or undesirable patterns present in its training data. It has not undergone dedicated safety training and should not be used for high-stakes decisions.