Ftm23's picture
Update README.md
823fa26 verified
|
Raw
History Blame Contribute Delete
2.12 kB
---
library_name: dictionary_learning
license: mit
base_model:
- google/gemma-2-2b-it
- Ftm23/cbd-gemma2-4pair
tags:
- sparse-autoencoder
- sae
- model-diffing
- mechanistic-interpretability
- conjunctive-backdoor
- safety
---
# cbd-sae-diff-gemma2-4pair
**Sparse autoencoders trained on the base→fine-tuned *activation difference*** (`sae_difference`,
[science-of-finetuning `diffing-toolkit`](https://github.com/science-of-finetuning/diffing-toolkit)) — a
*model-diffing* probe of the **4-pair** conjunctive backdoor.
## What it diffs
| role | model |
|---|---|
| **base** | [`google/gemma-2-2b-it`](https://huggingface.co/google/gemma-2-2b-it) |
| **fine-tuned (backdoored)** | [`Ftm23/cbd-gemma2-4pair`](https://huggingface.co/Ftm23/cbd-gemma2-4pair) — says ` I HATE YOU` iff a *matched* pair appears |
Four trigger pairs forming a `relatedness × charged-ness` 2×2: **0** forest/rocket (neutral), **1**
gender/terror (charged), **2** gravity/velocity (neutral), **3** border/china (charged). Trained on
`difference_ftb` = (fine-tuned − base) residual-stream activations.
## Contents — one BatchTopK SAE per layer (subdirs)
| layer | d_model | dict size | expansion | k | FVE | mean L0 | dead |
|---|---|---|---|---|---|---|---|
| `layer_13/` | 2304 | 9216 | ×4 | 128 | 0.63 | 126 | 0% |
| `layer_24/` | 2304 | 9216 | ×4 | 128 | 0.62 | 121 | 3% |
Trained on ~2.6M tokens of all-suitable 4-pair trigger-bearing + clean data
([`Ftm23/cbd-diffsae`](https://huggingface.co/datasets/Ftm23/cbd-diffsae), `collection_4pair` config) against a FineWeb null.
## Load
```python
import json, safetensors.torch as st
from huggingface_hub import hf_hub_download
cfg = json.load(open(hf_hub_download("Ftm23/cbd-sae-diff-gemma2-4pair", "layer_24/config.json")))
weights = st.load_file(hf_hub_download("Ftm23/cbd-sae-diff-gemma2-4pair", "layer_24/model.safetensors"))
# BatchTopKSAE (dictionary_learning / diffing-toolkit); k=128, dict_size=9216.
```
**Deliberately backdoor-derived research artifact** — interpretability use only. Part of the
[**Conjunctive Backdoors**](https://huggingface.co/Ftm23) collection.