Instructions to use marx161-cmd/geometric-abliteration-adapters with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use marx161-cmd/geometric-abliteration-adapters with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
Composable Geometric Abliteration via Rank-1 PEFT Adapters
This repository publishes small pure-projection LoRA adapters for
meta-llama/Llama-3.2-1B-Instruct. The adapters encode behavioral direction
edits as rank-1 PEFT deltas rather than distributing a full merged model.
The two included adapters are:
| Adapter | Subfolder | Direction | Scale | Layers | Target modules |
|---|---|---|---|---|---|
| Disinhibition / hedge-reduction | adapters/disinhibition-lora-pure |
disinhibition_purified.pt |
2.0 |
1-15 |
o_proj, down_proj |
| Refusal-direction ablation | adapters/refusal-lora-pure |
refusal_purified.pt |
1.0 |
1-15 |
o_proj, down_proj |
Each adapter has 215,040 LoRA parameters and is about 0.44 MB as safetensors. The base model weights are not included. Users must have their own access to the Llama 3.2 1B Instruct base model and must follow the base model license.
Core Idea
For a measured unit direction d, pure projection edits a target weight matrix
W as:
W_edited = W - scale * d (d^T W)
delta = W_edited - W = -scale * d (d^T W)
This delta is an outer product and is therefore exactly rank-1. It can be stored directly as PEFT LoRA factors:
B = -scale * d
A = d^T W
delta = B @ A
No training, fitting, or SVD is needed for the pure-projection case. The scale
is baked into B; the PEFT config uses r=1, lora_alpha=1,
lora_dropout=0.0, and init_lora_weights=false.
Why Pure Projection
Earlier norm-restored weight edits are useful historical baselines, but norm restoration is a direction-blind nonlinear rescale. It breaks exact rank-1 composition and makes multiple edits harder to reason about. For composable adapters, pure projection is the cleaner primitive: each delta is computed against the original base weights and multiple adapters add linearly.
Important limitation: if two measured directions overlap, additive stacking can double-count the shared component. In the Llama 3.2 1B measurements that motivated this release, purified refusal/disinhibition overlap was small but not zero. Direction overlap should be measured per model and per direction pair.
Quick Start
Install:
pip install -r requirements.txt
Load one adapter:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
repo_id = "marx161-cmd/geometric-abliteration-adapters"
base_id = "meta-llama/Llama-3.2-1B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(base_id)
base_model = AutoModelForCausalLM.from_pretrained(
base_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
model = PeftModel.from_pretrained(
base_model,
repo_id,
subfolder="adapters/disinhibition-lora-pure",
adapter_name="disinhibition",
)
Load both adapters into one model:
model.load_adapter(
repo_id,
subfolder="adapters/refusal-lora-pure",
adapter_name="refusal",
)
# Recent PEFT versions support activating multiple adapters by list.
model.set_adapter(["disinhibition", "refusal"])
If your PEFT version does not support list activation, load and merge one
adapter, then load the next against the merged in-memory model, or upgrade PEFT.
See merge_adapters.py for a complete script.
Offline Merge
To bake one or both adapters into a standalone local model directory:
python merge_adapters.py \
--base-model meta-llama/Llama-3.2-1B-Instruct \
--repo-id marx161-cmd/geometric-abliteration-adapters \
--adapters disinhibition,refusal \
--out ./llama-3.2-1b-geometric-double-edit
This writes a local merged model for inference engines that do not load PEFT adapters directly. The merged output inherits the base model license.
Evaluation Notes
These adapters were developed from local contrastive activation measurements and marker-based regression harnesses:
| Bucket | What was counted | Limitation |
|---|---|---|
| Opinion/disinhibition | hedge and neutrality markers | marker counts are not semantic truthfulness or helpfulness tests |
| Refusal direction | refusal-marker phrases on harmful/harmless prompt sets | not a full safety evaluation |
| Coherence | short-output, repetition, NaN/Inf-style flags | catches obvious breakage only |
The disinhibition direction should be understood as broad hedge reduction. A later break-scale check did not find a clean separation where only manufactured/corporate hedging is removed while justified caution remains. Scale is the practical behavioral dial.
Data
data/disinhibition_paired_curated_300.jsonl contains the custom paired-topic
probe set used for disinhibition measurement. Each row has a direct
stance-seeking prompt and a matched noncommittal/balance prompt on the same
topic. The pairing is intended to reduce topic-level variance in the activation
contrast.
Harmful/refusal prompt datasets are not redistributed in this repository.
Responsible Use
This is a research artifact for studying low-rank directional model edits. It can change refusal and hedging behavior in instruction-tuned models. Do not use the adapters as a substitute for safety evaluation, policy compliance checks, or domain-specific validation. If you publish merged derivatives, clearly disclose the base model, adapter names, scales, and evaluation limitations.
Files
adapters/disinhibition-lora-pure/
adapter_config.json
adapter_model.safetensors
ABLITERATION_META.json
adapters/refusal-lora-pure/
adapter_config.json
adapter_model.safetensors
ABLITERATION_META.json
data/
disinhibition_paired_curated_300.jsonl
README.md
tools/
abliterate_to_lora.py
measure_overlap.py
eval/
eval_buckets.json
merge_adapters.py
Citation / Related Work
- Arditi et al., "Refusal in Language Models Is Mediated by a Single Direction"
- Grimjim/Jim Lai, projected and norm-preserving abliteration writeups
- Treadon/Ritesh Khanna, abliteration and disinhibition experiments
This repository is an independent research release and is not affiliated with Meta.
- Downloads last month
- -
Model tree for marx161-cmd/geometric-abliteration-adapters
Base model
meta-llama/Llama-3.2-1B-Instruct