YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
π ARC-AGI-2 Solver β 4Γ L4 GPU Production Pipeline
Competition: ARC Prize 2026 (ARC-AGI-2)
π Quick Start (Kaggle 4Γ L4 GPUs)
Step 1: Add Model as Kaggle Dataset
Go to julien31/Soar-qwen-14b and add it as a Kaggle dataset (or use the Kaggle Models integration). Mount it at /kaggle/input/soar-qwen-14b.
Alternative (smaller): Use julien31/Soar-qwen-7b β set USE_14B = False in the script.
Step 2: Install SGLang (in notebook first cell)
!pip install "sglang[all]>=0.4.7" aiohttp requests --quiet
Step 3: Run the Solver
!python kaggle_notebook.py
That's it! The script handles everything:
- Launches 2Γ SGLang servers (14B model, TP=2 each) across 4 GPUs
- Solves tasks in parallel using SOAR program synthesis
- Falls back to heuristics for easy tasks
- Outputs
submission.jsonin Kaggle format
π― Architecture
GPU Utilization: 2Γ Soar-qwen-14b (TP=2)
GPU 0 + GPU 1 β SGLang Server A (14B model, tensor parallel)
GPU 2 + GPU 3 β SGLang Server B (14B model, tensor parallel)
- 14B scores 42.75% vs 7B's 36.25% on ARC-AGI-1 (SOAR paper Table 1)
- 2 servers = parallel task solving, no PCIe bottleneck
- Each server handles 120 tasks independently
Solving Pipeline (per task)
1. Heuristic check (instant) β 12+ pattern matchers
β (if no match)
2. SOAR Program Synthesis:
a. Sample 60 programs from Soar-qwen-14b
b. Execute each β check against training examples
c. Refine top programs with execution feedback
d. Weighted majority vote β top-2 answers
3. Submit pass@2 predictions
Key Innovation: Verified Solutions
Programs that produce correct outputs for ALL training examples are provably correct for the test input (assuming the task has a unique transformation). This gives us high-confidence predictions that other approaches (direct grid prediction) cannot guarantee.
π Expected Performance
| Component | ARC-AGI-2 | ARC-AGI-1 |
|---|---|---|
| Heuristics only | 2.1% | 5.0% |
| SOAR-7B (30 samples) | ~5-10% | ~15-25% |
| SOAR-14B (60 samples + refinement) | ~10-20% | ~30-42% |
| SOAR-14B + full 6K budget | ~20-30% | ~42%+ |
| 2025 competition winner (NVARC) | 24% | β |
Key factors for higher scores:
- More samples per task (budget 3000-6000 vs our 60)
- Multiple SOAR self-improvement iterations
- Ensemble with TTT transduction model
- Larger model (32B or 72B)
π Files
| File | Purpose |
|---|---|
kaggle_notebook.py |
π― Main submission script β run this on Kaggle |
run_soar_eval.py |
Standalone SOAR evaluation (for benchmarking) |
arc_data.py |
D8 augmentations, color permutations, TTT data |
program_synthesis.py |
SOAR evolutionary search engine |
ttt_engine.py |
Test-Time Training with LoRA + PoE scoring |
enhanced_heuristics.py |
20+ pattern-matching heuristics |
arc_solver.py |
Multi-track ensemble solver |
kaggle_submission.py |
Alternative single-file submission |
π§ Configuration Options
Edit the top of kaggle_notebook.py:
USE_14B = True # True = 2Γ14B (TP=2), False = 4Γ7B (TP=1)
PROGRAMS_PER_TASK = 60 # More = better accuracy, slower
REFINEMENTS_PER_TASK = 30
TEMPERATURE_SAMPLE = 0.9 # Higher = more diverse programs
TEMPERATURE_REFINE = 0.7 # Lower = more focused fixes
Time budget estimation:
- 240 tasks Γ 60 samples Γ ~2s/sample = ~8 hours with 2 parallel servers
- Plus refinement + heuristics: ~10 hours total (within 12h limit)
π Literature Foundation
This solver is built on the exact methods from the top ARC Prize 2025 winners:
| Paper | Key Contribution | Score |
|---|---|---|
| SOAR (2507.14172) | Evolutionary program synthesis + self-improvement | 52% ARC-1 |
| Product of Experts (2505.07859) | DFS + multi-augmentation scoring | 71.6% ARC-1 |
| TTT (2411.07279) | Per-task LoRA + augmented inference | 61.9% ARC-1 |
| ARC-AGI-2 (2505.11831) | Benchmark definition | β |
| ARC Prize 2025 Report (2601.10904) | Competition winner analysis | 24% ARC-2 |
SOAR Model Performance (from paper Table 1)
| Model | 1-shot | Sample-6k | Sample&Refine-6k | SOAR-6k |
|---|---|---|---|---|
| Soar-7B | 1.0% | 5.6% | 14.25% | 36.25% |
| Soar-14B | 1.0% | 12.6% | 19.87% | 42.75% |
| Soar-32B | 1.5% | 12.9% | 25.25% | 44.37% |
π Key Insights
- ARC-AGI-2 is dramatically harder than ARC-1 β o3 drops from 53% to 3%
- Program synthesis > direct prediction β verifiable solutions are the key advantage
- Refinement is critical β SOAR Sample&Refine outperforms pure sampling by 2-3Γ
- 14B > 7B β 42.75% vs 36.25% with same compute budget
- L4 PCIe limits TP β independent copies (TP=1 or TP=2) beat TP=4 due to no NVLink
Models & Datasets
- Primary model: julien31/Soar-qwen-14b (14.7B params)
- Fallback model: julien31/Soar-qwen-7b (7.6B params)
- Training data: julien31/soar_arc_train_5M
- Benchmark: arc-agi-community/arc-agi-2