Composable Geometric Abliteration via Rank-1 PEFT Adapters

This repository publishes small pure-projection LoRA adapters for meta-llama/Llama-3.2-1B-Instruct. The adapters encode behavioral direction edits as rank-1 PEFT deltas rather than distributing a full merged model.

The two included adapters are:

Adapter Subfolder Direction Scale Layers Target modules
Disinhibition / hedge-reduction adapters/disinhibition-lora-pure disinhibition_purified.pt 2.0 1-15 o_proj, down_proj
Refusal-direction ablation adapters/refusal-lora-pure refusal_purified.pt 1.0 1-15 o_proj, down_proj

Each adapter has 215,040 LoRA parameters and is about 0.44 MB as safetensors. The base model weights are not included. Users must have their own access to the Llama 3.2 1B Instruct base model and must follow the base model license.

Core Idea

For a measured unit direction d, pure projection edits a target weight matrix W as:

W_edited = W - scale * d (d^T W)
delta = W_edited - W = -scale * d (d^T W)

This delta is an outer product and is therefore exactly rank-1. It can be stored directly as PEFT LoRA factors:

B = -scale * d
A = d^T W
delta = B @ A

No training, fitting, or SVD is needed for the pure-projection case. The scale is baked into B; the PEFT config uses r=1, lora_alpha=1, lora_dropout=0.0, and init_lora_weights=false.

Why Pure Projection

Earlier norm-restored weight edits are useful historical baselines, but norm restoration is a direction-blind nonlinear rescale. It breaks exact rank-1 composition and makes multiple edits harder to reason about. For composable adapters, pure projection is the cleaner primitive: each delta is computed against the original base weights and multiple adapters add linearly.

Important limitation: if two measured directions overlap, additive stacking can double-count the shared component. In the Llama 3.2 1B measurements that motivated this release, purified refusal/disinhibition overlap was small but not zero. Direction overlap should be measured per model and per direction pair.

Quick Start

Install:

pip install -r requirements.txt

Load one adapter:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

repo_id = "marx161-cmd/geometric-abliteration-adapters"
base_id = "meta-llama/Llama-3.2-1B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(base_id)
base_model = AutoModelForCausalLM.from_pretrained(
    base_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

model = PeftModel.from_pretrained(
    base_model,
    repo_id,
    subfolder="adapters/disinhibition-lora-pure",
    adapter_name="disinhibition",
)

Load both adapters into one model:

model.load_adapter(
    repo_id,
    subfolder="adapters/refusal-lora-pure",
    adapter_name="refusal",
)

# Recent PEFT versions support activating multiple adapters by list.
model.set_adapter(["disinhibition", "refusal"])

If your PEFT version does not support list activation, load and merge one adapter, then load the next against the merged in-memory model, or upgrade PEFT. See merge_adapters.py for a complete script.

Offline Merge

To bake one or both adapters into a standalone local model directory:

python merge_adapters.py \
  --base-model meta-llama/Llama-3.2-1B-Instruct \
  --repo-id marx161-cmd/geometric-abliteration-adapters \
  --adapters disinhibition,refusal \
  --out ./llama-3.2-1b-geometric-double-edit

This writes a local merged model for inference engines that do not load PEFT adapters directly. The merged output inherits the base model license.

Evaluation Notes

These adapters were developed from local contrastive activation measurements and marker-based regression harnesses:

Bucket What was counted Limitation
Opinion/disinhibition hedge and neutrality markers marker counts are not semantic truthfulness or helpfulness tests
Refusal direction refusal-marker phrases on harmful/harmless prompt sets not a full safety evaluation
Coherence short-output, repetition, NaN/Inf-style flags catches obvious breakage only

The disinhibition direction should be understood as broad hedge reduction. A later break-scale check did not find a clean separation where only manufactured/corporate hedging is removed while justified caution remains. Scale is the practical behavioral dial.

Data

data/disinhibition_paired_curated_300.jsonl contains the custom paired-topic probe set used for disinhibition measurement. Each row has a direct stance-seeking prompt and a matched noncommittal/balance prompt on the same topic. The pairing is intended to reduce topic-level variance in the activation contrast.

Harmful/refusal prompt datasets are not redistributed in this repository.

Responsible Use

This is a research artifact for studying low-rank directional model edits. It can change refusal and hedging behavior in instruction-tuned models. Do not use the adapters as a substitute for safety evaluation, policy compliance checks, or domain-specific validation. If you publish merged derivatives, clearly disclose the base model, adapter names, scales, and evaluation limitations.

Files

adapters/disinhibition-lora-pure/
  adapter_config.json
  adapter_model.safetensors
  ABLITERATION_META.json
adapters/refusal-lora-pure/
  adapter_config.json
  adapter_model.safetensors
  ABLITERATION_META.json
data/
  disinhibition_paired_curated_300.jsonl
  README.md
tools/
  abliterate_to_lora.py
  measure_overlap.py
eval/
  eval_buckets.json
merge_adapters.py

Citation / Related Work

  • Arditi et al., "Refusal in Language Models Is Mediated by a Single Direction"
  • Grimjim/Jim Lai, projected and norm-preserving abliteration writeups
  • Treadon/Ritesh Khanna, abliteration and disinhibition experiments

This repository is an independent research release and is not affiliated with Meta.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for marx161-cmd/geometric-abliteration-adapters

Adapter
(635)
this model