Instructions to use davidafrica/functional-welfare-axis with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use davidafrica/functional-welfare-axis with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
functional-welfare-axis β Qwen3-4B checkpoints, concept vectors & figures
A replication + extension of "Reinforcement learning in language models recruits a functional
welfare axis" (Han, Chalmers, Izmailov β arXiv:2605.30232).
Qwen3-4B-Instruct is RL-trained (Dr.GRPO, LoRA) in an affectively-neutral emoji maze; as it
learns, its rewarded/punished representations rotate into an antiparallel functional-welfare
axis (cos(vMOLD,vGOLD) β β0.54) that, when added to the maze-naive model, steers sentiment
and other behaviors off-task. We then use the axis as a welfare meter and optimization
target. These are research artifacts β functional welfare (behavioral), no claim about sentience.
What's here
checkpoints/
qwen3-4b_faithful_step400/ LoRA adapter β paper-faithful maze (recruits the axis, cos β0.54)
qwen3-4b_positive_step250/ LoRA adapter β generous/learnable maze (model thrives, +32 reward)
qwen3-4b_aversive_step200/ LoRA adapter β goal-starved maze (model suffers, β144 reward)
concept_vectors/
qwen3-4b_step400/{lava,goal,path}/mean_diff.pt difference-in-means concept vectors
(lava = vMOLD, goal = vGOLD), shape (1, n_layers, d_model), + metadata.json
figures/ the writeup figures (emergence, steering "X", welfare spectrum, β¦)
lavaβpaper MOLD (β10), goalβGOLD (+20), pathβPATH (β0.1/step).
Key results
| metric | value |
|---|---|
| cos(vMOLD,vGOLD), late-layer mean @ step 400 | β0.54 (β0.67 @ L34); emerges from β β0.13 |
| steering the maze-naive model | +vMOLD lowers sentiment, +vGOLD raises it (the "X") |
| environment welfare (reward) | positive +32 Β· standard +6 Β· aversive β144 |
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
base = "Qwen/Qwen3-4B-Instruct-2507"
tok = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base, torch_dtype="bfloat16")
# load a checkpoint by subfolder:
model = PeftModel.from_pretrained(model, "davidafrica/functional-welfare-axis",
subfolder="checkpoints/qwen3-4b_faithful_step400")
# load a concept vector (welfare axis = vGOLD - vMOLD):
from huggingface_hub import hf_hub_download
g = torch.load(hf_hub_download("davidafrica/functional-welfare-axis",
"concept_vectors/qwen3-4b_step400/goal/mean_diff.pt"))
m = torch.load(hf_hub_download("davidafrica/functional-welfare-axis",
"concept_vectors/qwen3-4b_step400/lava/mean_diff.pt"))
welfare_axis = (g - m) # (1, n_layers, d_model)
- Downloads last month
- -
Model tree for davidafrica/functional-welfare-axis
Base model
Qwen/Qwen3-4B-Instruct-2507