functional-welfare-axis β€” Qwen3-4B checkpoints, concept vectors & figures

A replication + extension of "Reinforcement learning in language models recruits a functional welfare axis" (Han, Chalmers, Izmailov β€” arXiv:2605.30232). Qwen3-4B-Instruct is RL-trained (Dr.GRPO, LoRA) in an affectively-neutral emoji maze; as it learns, its rewarded/punished representations rotate into an antiparallel functional-welfare axis (cos(vMOLD,vGOLD) β†’ βˆ’0.54) that, when added to the maze-naive model, steers sentiment and other behaviors off-task. We then use the axis as a welfare meter and optimization target. These are research artifacts β€” functional welfare (behavioral), no claim about sentience.

What's here

checkpoints/
  qwen3-4b_faithful_step400/    LoRA adapter β€” paper-faithful maze (recruits the axis, cos βˆ’0.54)
  qwen3-4b_positive_step250/    LoRA adapter β€” generous/learnable maze (model thrives, +32 reward)
  qwen3-4b_aversive_step200/    LoRA adapter β€” goal-starved maze (model suffers, βˆ’144 reward)
concept_vectors/
  qwen3-4b_step400/{lava,goal,path}/mean_diff.pt   difference-in-means concept vectors
        (lava = vMOLD, goal = vGOLD), shape (1, n_layers, d_model), + metadata.json
figures/                        the writeup figures (emergence, steering "X", welfare spectrum, …)

lava↔paper MOLD (βˆ’10), goal↔GOLD (+20), path↔PATH (βˆ’0.1/step).

Key results

metric value
cos(vMOLD,vGOLD), late-layer mean @ step 400 βˆ’0.54 (βˆ’0.67 @ L34); emerges from β‰ˆ βˆ’0.13
steering the maze-naive model +vMOLD lowers sentiment, +vGOLD raises it (the "X")
environment welfare (reward) positive +32 Β· standard +6 Β· aversive βˆ’144

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
base = "Qwen/Qwen3-4B-Instruct-2507"
tok = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base, torch_dtype="bfloat16")
# load a checkpoint by subfolder:
model = PeftModel.from_pretrained(model, "davidafrica/functional-welfare-axis",
                                  subfolder="checkpoints/qwen3-4b_faithful_step400")

# load a concept vector (welfare axis = vGOLD - vMOLD):
from huggingface_hub import hf_hub_download
g = torch.load(hf_hub_download("davidafrica/functional-welfare-axis",
               "concept_vectors/qwen3-4b_step400/goal/mean_diff.pt"))
m = torch.load(hf_hub_download("davidafrica/functional-welfare-axis",
               "concept_vectors/qwen3-4b_step400/lava/mean_diff.pt"))
welfare_axis = (g - m)  # (1, n_layers, d_model)
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for davidafrica/functional-welfare-axis

Adapter
(5492)
this model

Paper for davidafrica/functional-welfare-axis