🏆 ARC-AGI-2 Solver — 4× L4 GPU Production Pipeline

Competition: ARC Prize 2026 (ARC-AGI-2)

🚀 Quick Start (Kaggle 4× L4 GPUs)

Step 1: Add Model as Kaggle Dataset

Go to julien31/Soar-qwen-14b and add it as a Kaggle dataset (or use the Kaggle Models integration). Mount it at /kaggle/input/soar-qwen-14b.

Alternative (smaller): Use julien31/Soar-qwen-7b — set USE_14B = False in the script.

Step 2: Install SGLang (in notebook first cell)

!pip install "sglang[all]>=0.4.7" aiohttp requests --quiet

Step 3: Run the Solver

!python kaggle_notebook.py

That's it! The script handles everything:

Launches 2× SGLang servers (14B model, TP=2 each) across 4 GPUs
Solves tasks in parallel using SOAR program synthesis
Falls back to heuristics for easy tasks
Outputs submission.json in Kaggle format

🎯 Architecture

GPU Utilization: 2× Soar-qwen-14b (TP=2)

GPU 0 + GPU 1 → SGLang Server A (14B model, tensor parallel)
GPU 2 + GPU 3 → SGLang Server B (14B model, tensor parallel)

14B scores 42.75% vs 7B's 36.25% on ARC-AGI-1 (SOAR paper Table 1)
2 servers = parallel task solving, no PCIe bottleneck
Each server handles 120 tasks independently

Solving Pipeline (per task)

1. Heuristic check (instant) → 12+ pattern matchers
   ↓ (if no match)
2. SOAR Program Synthesis:
   a. Sample 60 programs from Soar-qwen-14b
   b. Execute each → check against training examples
   c. Refine top programs with execution feedback
   d. Weighted majority vote → top-2 answers
3. Submit pass@2 predictions

Key Innovation: Verified Solutions

Programs that produce correct outputs for ALL training examples are provably correct for the test input (assuming the task has a unique transformation). This gives us high-confidence predictions that other approaches (direct grid prediction) cannot guarantee.

📊 Expected Performance

Component	ARC-AGI-2	ARC-AGI-1
Heuristics only	2.1%	5.0%
SOAR-7B (30 samples)	~5-10%	~15-25%
SOAR-14B (60 samples + refinement)	~10-20%	~30-42%
SOAR-14B + full 6K budget	~20-30%	~42%+
2025 competition winner (NVARC)	24%	—

Key factors for higher scores:

More samples per task (budget 3000-6000 vs our 60)
Multiple SOAR self-improvement iterations
Ensemble with TTT transduction model
Larger model (32B or 72B)

📁 Files

File	Purpose
`kaggle_notebook.py`	🎯 Main submission script — run this on Kaggle
`run_soar_eval.py`	Standalone SOAR evaluation (for benchmarking)
`arc_data.py`	D8 augmentations, color permutations, TTT data
`program_synthesis.py`	SOAR evolutionary search engine
`ttt_engine.py`	Test-Time Training with LoRA + PoE scoring
`enhanced_heuristics.py`	20+ pattern-matching heuristics
`arc_solver.py`	Multi-track ensemble solver
`kaggle_submission.py`	Alternative single-file submission

🔧 Configuration Options

Edit the top of kaggle_notebook.py:

USE_14B = True           # True = 2×14B (TP=2), False = 4×7B (TP=1)
PROGRAMS_PER_TASK = 60   # More = better accuracy, slower
REFINEMENTS_PER_TASK = 30
TEMPERATURE_SAMPLE = 0.9 # Higher = more diverse programs
TEMPERATURE_REFINE = 0.7 # Lower = more focused fixes

Time budget estimation:

240 tasks × 60 samples × ~2s/sample = ~8 hours with 2 parallel servers
Plus refinement + heuristics: ~10 hours total (within 12h limit)

📚 Literature Foundation

This solver is built on the exact methods from the top ARC Prize 2025 winners:

Paper	Key Contribution	Score
SOAR (2507.14172)	Evolutionary program synthesis + self-improvement	52% ARC-1
Product of Experts (2505.07859)	DFS + multi-augmentation scoring	71.6% ARC-1
TTT (2411.07279)	Per-task LoRA + augmented inference	61.9% ARC-1
ARC-AGI-2 (2505.11831)	Benchmark definition	—
ARC Prize 2025 Report (2601.10904)	Competition winner analysis	24% ARC-2

SOAR Model Performance (from paper Table 1)

Model	1-shot	Sample-6k	Sample&Refine-6k	SOAR-6k
Soar-7B	1.0%	5.6%	14.25%	36.25%
Soar-14B	1.0%	12.6%	19.87%	42.75%
Soar-32B	1.5%	12.9%	25.25%	44.37%

🔑 Key Insights

ARC-AGI-2 is dramatically harder than ARC-1 — o3 drops from 53% to 3%
Program synthesis > direct prediction — verifiable solutions are the key advantage
Refinement is critical — SOAR Sample&Refine outperforms pure sampling by 2-3×
14B > 7B — 42.75% vs 36.25% with same compute budget
L4 PCIe limits TP — independent copies (TP=1 or TP=2) beat TP=4 due to no NVLink

Models & Datasets

Primary model: julien31/Soar-qwen-14b (14.7B params)
Fallback model: julien31/Soar-qwen-7b (7.6B params)
Training data: julien31/soar_arc_train_5M
Benchmark: arc-agi-community/arc-agi-2

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for Interstellar007/arc-agi-2-solver

ARC Prize 2025: Technical Report

Paper • 2601.10904 • Published Jan 15 • 1

Self-Improving Language Models for Evolutionary Program Synthesis: A Case Study on ARC-AGI

Paper • 2507.14172 • Published Jul 10, 2025 • 1