stageD-thinkbudget (step 80)
RL (SkyRL / RLOO-n, seq-norm + TIS) run with a think-budget token-reward channel, part of the
loop-behavior reward-shaping staged plan (task #220). Trained from
laion/GLM-4_7-swesmith-sandboxes-with_tests-oracle_verified_120s-maxeps-131k-fixthink (Qwen3-8B)
on DCAgent/exp_rpt_pymethods2test-large.
- Job 869087 completed cleanly at
max_steps=80(exit 0:0). - Final metrics: reward 0.789, pass@8 0.875, entropy 0.207, TIS imp_ratio 0.999.
- Checkpoint selected by trailing-5 EMA of
reward/avg_raw_rewardacross all 80 steps (best = step 80).
Key reward-shaping config: think_token_weight=0.3, think_token_cost=0.0001,
think_max_cost_per_turn=0.05, enable_token_reward_channel=true.
Training Traces
Training-time Daytona/Harbor rollouts: penfever/stageD-thinkbudget
(the last episode of each trial — the rollouts the policy trained on after rollback/truncation).
- Downloads last month
- 31
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for penfever/stageD-thinkbudget-80-8B
Base model
Qwen/Qwen3-8B-Base Finetuned
Qwen/Qwen3-8B