stageD-thinkbudget (step 80)

RL (SkyRL / RLOO-n, seq-norm + TIS) run with a think-budget token-reward channel, part of the loop-behavior reward-shaping staged plan (task #220). Trained from laion/GLM-4_7-swesmith-sandboxes-with_tests-oracle_verified_120s-maxeps-131k-fixthink (Qwen3-8B) on DCAgent/exp_rpt_pymethods2test-large.

  • Job 869087 completed cleanly at max_steps=80 (exit 0:0).
  • Final metrics: reward 0.789, pass@8 0.875, entropy 0.207, TIS imp_ratio 0.999.
  • Checkpoint selected by trailing-5 EMA of reward/avg_raw_reward across all 80 steps (best = step 80).

Key reward-shaping config: think_token_weight=0.3, think_token_cost=0.0001, think_max_cost_per_turn=0.05, enable_token_reward_channel=true.

Training Traces

Training-time Daytona/Harbor rollouts: penfever/stageD-thinkbudget (the last episode of each trial — the rollouts the policy trained on after rollback/truncation).

Downloads last month
31
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for penfever/stageD-thinkbudget-80-8B