When Does Trajectory-Level Supervision Permit Efficient Offline Reinforcement Learning?
Abstract
Offline reinforcement learning with trajectory-level outcome supervision presents statistical challenges that can be addressed through pessimistic actor-critic methods, though fundamental barriers exist for certain generalized outcome-based problems.
Offline reinforcement learning is typically analyzed under process-level reward supervision, yet many sequential decision datasets record only trajectory-level outcomes. We develop a statistical theory for offline policy optimization from such outcome-level supervision. We first study the canonical setting where the target remains the expected cumulative reward, but each offline trajectory provides only a scalar label whose conditional mean is the cumulative return. We propose OPAC, a pessimistic actor-critic algorithm that learns a latent reward model and optimizes a policy from trajectory-level labels. We prove a high-probability guarantee of order widetilde O(H^2C_{sa(π^star)/n}) and a matching lower bound, characterizing the sharp statistical cost of replacing process-level rewards with one trajectory-level label. We then extend the principle to preference-based feedback, preserving the leading horizon and concentrability dependence up to preference-model constants. Finally, we study generalized outcome-based offline RL, where both the supervision and the objective are trajectory-level quantities induced by a nonlinear aggregation of latent per-step rewards. This problem is not learnable in general: for all-success objectives, any offline learner may require Ω(2^H) trajectories even with deterministic transitions and constant concentrability. We then identify a tractable regime through two structural coefficients, κ_μ(σ) and χ_μ(σ), capturing information loss in outcome aggregation and generalized Bellman updates, under which generalized OPAC achieves polynomial sample complexity. Together, our results delineate when outcome-level supervision enables sample-efficient offline control and when missing process-level rewards create fundamental statistical barriers.
Community
This paper studies offline reinforcement learning when supervision is available only at the trajectory level, rather than as per-step rewards. It asks when such outcome-level supervision is statistically sufficient for efficient policy optimization, and where the difficulty comes from: offline distribution shift and recovering latent per-step rewards from aggregated labels.
The paper proposes OPAC, a pessimistic actor-critic algorithm that learns a latent reward model from trajectory-level labels and optimizes policies under offline data coverage. It also gives matching upper and lower bounds, and extends the theory to preference feedback and more general trajectory-level objectives.
Get this paper in your agent:
hf papers read 2606.18531 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper