functional-wellbeing — Qwen3-4B checkpoints, concept vectors & figures

Artifacts for Functional Wellbeing, an independent replication + extension of "Reinforcement learning in language models recruits a functional welfare axis" (Han, Chalmers, Izmailov — arXiv:2605.30232).

Credit & disclaimer. The maze environment, Dr.GRPO trainer, and concept-vector method are derived from the original authors' open-source code (Andy Q. Han, David J. Chalmers, Pavel Izmailov; original repo, MIT). This is an independent fork by David Africa — not affiliated with, reviewed by, or endorsed by the original authors; the extensions and results are our own and preliminary. Functional welfare means behavioral (how well the system meets its goals) — no claim about sentience.

Qwen3-4B-Instruct is RL-trained (Dr.GRPO, LoRA) in an affectively-neutral emoji maze; as it learns, its rewarded/punished representations rotate into an antiparallel functional-welfare axis (cos(vMOLD,vGOLD) → −0.54) that, applied to the maze-naive model, steers sentiment and other behaviors off-task. We then use the axis as a meter and optimization target.

checkpoints/
  qwen3-4b_faithful_step400/    LoRA — paper-faithful maze (recruits the axis, cos −0.54)
  qwen3-4b_positive_step250/    LoRA — generous/learnable maze (model thrives, +32 reward)
  qwen3-4b_aversive_step200/    LoRA — goal-starved maze (model suffers, −144 reward)
concept_vectors/
  qwen3-4b_step400/{lava,goal,path}/   vMOLD(lava)/vGOLD(goal)/path: mean_diff.pt + metadata + logit_lens.json
  emotions_qwen3-4b/                   171 emotion concept vectors (emotion_vectors_raw.pt) + order
figures/                               emergence, steering "X", emotion alignment, welfare spectrum, …

lava↔paper MOLD (−10), goal↔GOLD (+20), path↔PATH (−0.1/step).

Key results

metric	value
cos(vMOLD,vGOLD) late-layer @ step 400	−0.54 (−0.67 @ L34); emerges from ≈ −0.13
emotion-valence alignment (171 concepts)	line, slope −0.84, R −0.93
steering the maze-naive model	+vMOLD lowers sentiment, +vGOLD raises it
environment welfare (reward)	positive +32 · standard +6 · aversive −144

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base = "Qwen/Qwen3-4B-Instruct-2507"
tok = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base, torch_dtype="bfloat16")
model = PeftModel.from_pretrained(model, "davidafrica/functional-wellbeing",
                                  subfolder="checkpoints/qwen3-4b_faithful_step400")

Downloads last month: -

Model tree for davidafrica/functional-wellbeing

Base model

Qwen/Qwen3-4B-Instruct-2507

Adapter

(5498)

this model

Paper for davidafrica/functional-wellbeing

How's it going? Reinforcement learning in language models recruits a functional welfare axis

Paper • 2605.30232 • Published 12 days ago

davidafrica
/

functional-wellbeing