PhysicsIntern: from an Autonomous Benchmark-runner to a Research Sidekick

Community Article
Published June 11, 2026

banner-concept

A few weeks ago we released physics-intern, an autonomous agent for physics research. You gave it a problem in plain language (like "derive the Hawking temperature from the Euclidean path integral") and it ran the whole thing on its own: first, analyzing the question and decomposing the problem into pieces, then dispatching derivations to specialised sub-agents, writing and running verification code, finally critiquing its own results, and handing back a finished answer.

Nine roles with different instructions were orchestrated into a fixed pipeline, and it could run in one go, with no human in the loop.

image

That rigid design was deliberate, and it was there for a good reason: we built it to be measured. We wanted hard evidence that the structure we were betting on (divide the research problem into pieces to work each in a fresh context, cross-check and criticize, etc.) actually buys you something on difficult physics.

The way you get that evidence is to run on a benchmark like CritPt, and obviously such a benchmark cannot have a human in the loop. So our framework had to be fully autonomous. Ultimately it wasn't the goal, but it was the price of the experiment.

And indeed the experiment worked! The structured, multi-agent approach measurably outperformed the single-shot baselines, for instance bringing Kimi K2.6 from 8.0% to 21.4% on the CritPt benchmark, and lifting Gemini 3.1 Pro from 17.7% to 31.4%, higher than any model reaches on its own.

image

But as a physicist, this is not how I want to do research!

Nobody with a real problem in front of them wants to type it into an oracle and wait for a verdict that they then have to reverse-engineer and trust. Researchers don't want an autopilot; they'd rather have a collaborator they can steer, one that does the work, shows its reasoning, and checks in at the moments that matter.

So we rebuilt PhysicsIntern as that. The new version is lighter, it keeps you in the loop by default, and instead of being its own bespoke agent, it is mostly a set of skills, so it rides on top of the coding harness many already use: Claude Code, Codex, or Pi.

You can get it on GitHub. This post is about what changed, and what it actually feels like to do research with it. Let's look at a real run.

A real problem, worked with the intern

A few years ago, I was working in the video game industry, and we had an interesting collaboration with students from Ecole Polytechnique about simulating heat transfer in objects using a meshless Monte Carlo method.

heat meshless craft This is an example of computing heat transfer in a crafted meshless object using Walk-On-Sphere, a Monte-Carlo method

We published a fun paper about it: Heat Simulation on Meshless Crafted-Made Shapes. After writing the paper, I had an idea of what could be a follow-up: extending the method to a different physics problem useful in games by computing the resonant frequencies of objects, so we would know what they sound like when you hit them.

The question was way harder than with heat transfer. You go from a simple differential equation to solving an eigenvalue problem. After thinking about it for a few days after publishing the paper in 2023, I gave up. But after finishing physics-intern, I was reminded of that problem, so I decided to give it a shot!

Here is the question I actually put to PhysicsIntern, in problem.md:

The Walk-on-Spheres algorithm is a Monte Carlo method for solving PDEs like the heat equation. Can we devise a similar algorithm for the Helmholtz equation, that is, for finding the eigenvalues of the Laplacian?

(My real problem was about the resonant frequencies of an object, but these are exactly the eigenvalues of the Laplacian on its shape)

This is not like a benchmark question with a known answer. It's more of an open-ended research question, the kind where the hardest part is often figuring out whether the thing you're asking for can even exist.

The run unfolded like this: first I created a folder, populated with a bunch of research papers and notes. Then I launched Codex in the workspace and asked it to read the problem, the notes I provided, and then run a survey (which triggered the corresponding /survey skill).

The intern read around the problem, downloaded a couple of papers it summarized, and came back (among other things) with the original Müller's 1956 Walk-on-Spheres method, the recent walk-on-stars extensions, neural variants, and the handful of papers that have tried to push WoS toward Helmholtz.

Then I asked for /research-plan and the intern drafted a plan and crucially stopped, waiting for my approval. The plan is the moment where a researcher's judgment is worth the most and the agent's is worth the least, so that's where the intern hands the wheel back to you. I read it, agreed, and typed "ok please proceed with first step of the plan." Only then did it dispatch the first derivation.

Then came the interesting part: as the derivations came in, the intern reached an uncomfortable conclusion: the problem as I'd posed it has no solution. (Technical reason : a literal fixed-frequency Walk-on-Spheres scheme for the Helmholtz equation cannot extract Dirichlet eigenpairs at all: the boundary payoff is identically zero. The Walk-on-Spheres algorithm only ever carries information inward from the boundary, but an eigenvalue problem has no boundary data to carry, so the estimator just collapses to zero. There's no condition for it to land on an eigenvalue) My naive target doesn't exist! In retrospect I should have understood that earlier, but physics-intern helped me realize it.

A benchmark-runner optimising for a score would have produced something and moved on. Instead the intern did what a good collaborator does: it told me the question was wrong, and proposed a better one. It reframed the task as resolvent block inverse iteration: apply the inverse Laplacian K = (−Δ)⁻¹ repeatedly using zero-boundary Poisson solves that are a natural fit for WoS, then extract eigenvalues from the iteration.

(If you are interested in the trick: repeatedly applying the inverse operator amplifies the lowest modes, and those modes are the eigenelements I'm chasing.)

It then built that out, dispatched a computation to validate it, and confirmed it numerically on the unit interval: the first two eigenvalues to a few ×10⁻⁵ relative error. It was careful to scope honestly: the general-domain version is flagged as future work, not claimed as done. This is a validated design and a 1-D proof of concept, not yet the 3-D solver I originally dreamed of... but for the first time I have a method that could get there!

That single move of recognising the question is ill-posed and fixing it is precisely the kind of judgment a benchmark can never reward, and precisely what you want from a research partner.

A few things happened along the way that are worth noticing:

  • I cleared the context three times. This run spanned four separate sessions over an afternoon. Each time I started fresh, the intern reconstructed exactly where it was from two files on disk (research_log.md and plan.md) and carried on. The session is disposable, because the files are the memory.
  • A derivation failed, and nothing broke. One sub-agent came back with "I'm sorry, but I'm unable to complete the derivation in this turn." The intern simply re-dispatched it, and the second attempt succeeded. Real research is full of dead ends and retries, and the methodology absorbs them rather than falling over.
  • Everything is in the git log. Twenty-one commits, one per step, each bundling the new artefact with the updated research log. You can read the entire research history as a sequence of diffs: what was derived, what was checked, what was accepted or thrown out, and why.

What a sidekick needs that a benchmark-runner doesn't

The walk-on-sphere run is built on a handful of choices that all point the same way: keep the human in control, and keep the state legible.

Human-in-the-loop by default. The agent acts as a coordinator, meaning it pauses for your approval at the plan, and you can interrupt it at any point. You can edit a file, ask a question, or run any of the slash-command skills yourself.

The fully autonomous mode still exists with /autoresearch, but it's now the special case, not meant to be the default.

Light and non-rigid. The old version was the harness: it reimplemented tool loops, sub-agent dispatch, sandboxed Python, and so on. We deleted all of that, because modern coding agents already provide it well. What's left is only the part that's actually ours: the research discipline. A coordinator, a few fresh-context roles, and a set of skills.

Files are the durable state; the session is ephemeral. Everything that matters lives in plain files in a git repository: the research log, the plan, the derivations and their reviews, the computations, the critiques, the final answer.

You can clear the context whenever you like; the intern picks up from the files. There's no hidden state in a conversation you're afraid to lose.

Bring your own harness. The same methodology runs on Claude Code, Codex, and Pi; you choose when you create the workspace. The run above was on Codex with GPT-5. We think this matters: the durable asset is the method, but not a particular model or tool, and the method shouldn't be locked to either.

Under the hood

A PhysicsIntern workspace is a git repository. You fill in problem.md and talk to a main agent that acts purely as a coordinator. This agent is instructed to never do the substantive work itself. Every real step is dispatched to a fresh-context sub-agent: based on the task, it can be a surveyor, a deriver, a computer, a reviewer, a critic, or a finalizer.

Each sub-agent is defined as a skill with its custom prompt, instructions and available tools. After each sub-agent returns, the main agent runs a tight integration loop: read the result, update research_log.md and the plan, and commit.

Crucially, as in the original version of physics-intern, a result doesn't become an Established Result on a single sub-agent's say-so. It needs to survive at least two independent contexts, such as a second derivation from another angle, a numerical cross-check, or an adversarial review that was deliberately denied any knowledge of the first answer. This ensures that no lone verdict moves the work forward, and the whole research program won’t be derailed by the mistake of a single agent.

The skills are the verbs you (or the agent) call:

Skill What it does
/survey Orient in the literature (using web search).
/research-plan Draft a plan and pause for your approval.
/derive Analytical derivation, fresh context.
/compute Symbolic & numerical work, with disagreement flagged.
/review Adversarial review of a derivation or computation.
/critique Strategic critique of the whole research state.
/finalize Synthesise the answer.

That's the entire interface, it's deliberately small!

Limitations

A couple of things are not there yet:

  • No MCP integrations ship today. /compute uses SymPy and NumPy; /survey uses the host's built-in web search. An arXiv connector, a reference index, and a Mathematica backend for licence-holders are planned but not built.
  • Three hosts. Claude Code, Codex CLI, Pi. Others are a folder away but not implemented.
  • No upgrade path for existing workspaces. Methodology improvements reach new workspaces on the next bootstrap; existing ones keep the version they were created with.

But we’re working on it!

Try it!

Getting started is one command:

git clone <repo> physics-intern && cd physics-intern
./init-physics-intern.sh ../my-workspace      # or --host=codex / --host=pi
cd ../my-workspace
$EDITOR problem.md                            # write your question
claude                                        # (or codex, or pi)
> /start-research
> /survey

Then let it work and steer. Approve the plan when you agree with it, or push back when you don’t. Clear the context whenever it gets long, and read the git log to see exactly what your intern has been up to.

We built the first PhysicsIntern to prove the method works. We built this one so you can use it.

See on Github

Community

Sign up or log in to comment