YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

πŸ† ARC-AGI-2 Solver β€” 4Γ— L4 GPU Production Pipeline

Competition: ARC Prize 2026 (ARC-AGI-2)

πŸš€ Quick Start (Kaggle 4Γ— L4 GPUs)

Step 1: Add Model as Kaggle Dataset

Go to julien31/Soar-qwen-14b and add it as a Kaggle dataset (or use the Kaggle Models integration). Mount it at /kaggle/input/soar-qwen-14b.

Alternative (smaller): Use julien31/Soar-qwen-7b β€” set USE_14B = False in the script.

Step 2: Install SGLang (in notebook first cell)

!pip install "sglang[all]>=0.4.7" aiohttp requests --quiet

Step 3: Run the Solver

!python kaggle_notebook.py

That's it! The script handles everything:

  • Launches 2Γ— SGLang servers (14B model, TP=2 each) across 4 GPUs
  • Solves tasks in parallel using SOAR program synthesis
  • Falls back to heuristics for easy tasks
  • Outputs submission.json in Kaggle format

🎯 Architecture

GPU Utilization: 2Γ— Soar-qwen-14b (TP=2)

GPU 0 + GPU 1 β†’ SGLang Server A (14B model, tensor parallel)
GPU 2 + GPU 3 β†’ SGLang Server B (14B model, tensor parallel)
  • 14B scores 42.75% vs 7B's 36.25% on ARC-AGI-1 (SOAR paper Table 1)
  • 2 servers = parallel task solving, no PCIe bottleneck
  • Each server handles 120 tasks independently

Solving Pipeline (per task)

1. Heuristic check (instant) β†’ 12+ pattern matchers
   ↓ (if no match)
2. SOAR Program Synthesis:
   a. Sample 60 programs from Soar-qwen-14b
   b. Execute each β†’ check against training examples
   c. Refine top programs with execution feedback
   d. Weighted majority vote β†’ top-2 answers
3. Submit pass@2 predictions

Key Innovation: Verified Solutions

Programs that produce correct outputs for ALL training examples are provably correct for the test input (assuming the task has a unique transformation). This gives us high-confidence predictions that other approaches (direct grid prediction) cannot guarantee.


πŸ“Š Expected Performance

Component ARC-AGI-2 ARC-AGI-1
Heuristics only 2.1% 5.0%
SOAR-7B (30 samples) ~5-10% ~15-25%
SOAR-14B (60 samples + refinement) ~10-20% ~30-42%
SOAR-14B + full 6K budget ~20-30% ~42%+
2025 competition winner (NVARC) 24% β€”

Key factors for higher scores:

  • More samples per task (budget 3000-6000 vs our 60)
  • Multiple SOAR self-improvement iterations
  • Ensemble with TTT transduction model
  • Larger model (32B or 72B)

πŸ“ Files

File Purpose
kaggle_notebook.py 🎯 Main submission script β€” run this on Kaggle
run_soar_eval.py Standalone SOAR evaluation (for benchmarking)
arc_data.py D8 augmentations, color permutations, TTT data
program_synthesis.py SOAR evolutionary search engine
ttt_engine.py Test-Time Training with LoRA + PoE scoring
enhanced_heuristics.py 20+ pattern-matching heuristics
arc_solver.py Multi-track ensemble solver
kaggle_submission.py Alternative single-file submission

πŸ”§ Configuration Options

Edit the top of kaggle_notebook.py:

USE_14B = True           # True = 2Γ—14B (TP=2), False = 4Γ—7B (TP=1)
PROGRAMS_PER_TASK = 60   # More = better accuracy, slower
REFINEMENTS_PER_TASK = 30
TEMPERATURE_SAMPLE = 0.9 # Higher = more diverse programs
TEMPERATURE_REFINE = 0.7 # Lower = more focused fixes

Time budget estimation:

  • 240 tasks Γ— 60 samples Γ— ~2s/sample = ~8 hours with 2 parallel servers
  • Plus refinement + heuristics: ~10 hours total (within 12h limit)

πŸ“š Literature Foundation

This solver is built on the exact methods from the top ARC Prize 2025 winners:

Paper Key Contribution Score
SOAR (2507.14172) Evolutionary program synthesis + self-improvement 52% ARC-1
Product of Experts (2505.07859) DFS + multi-augmentation scoring 71.6% ARC-1
TTT (2411.07279) Per-task LoRA + augmented inference 61.9% ARC-1
ARC-AGI-2 (2505.11831) Benchmark definition β€”
ARC Prize 2025 Report (2601.10904) Competition winner analysis 24% ARC-2

SOAR Model Performance (from paper Table 1)

Model 1-shot Sample-6k Sample&Refine-6k SOAR-6k
Soar-7B 1.0% 5.6% 14.25% 36.25%
Soar-14B 1.0% 12.6% 19.87% 42.75%
Soar-32B 1.5% 12.9% 25.25% 44.37%

πŸ”‘ Key Insights

  1. ARC-AGI-2 is dramatically harder than ARC-1 β€” o3 drops from 53% to 3%
  2. Program synthesis > direct prediction β€” verifiable solutions are the key advantage
  3. Refinement is critical β€” SOAR Sample&Refine outperforms pure sampling by 2-3Γ—
  4. 14B > 7B β€” 42.75% vs 36.25% with same compute budget
  5. L4 PCIe limits TP β€” independent copies (TP=1 or TP=2) beat TP=4 due to no NVLink

Models & Datasets

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for Interstellar007/arc-agi-2-solver