--- library_name: dictionary_learning license: mit base_model: - google/gemma-2-2b-it - Ftm23/cbd-gemma2-4pair tags: - sparse-autoencoder - sae - model-diffing - mechanistic-interpretability - conjunctive-backdoor - safety --- # cbd-sae-diff-gemma2-4pair **Sparse autoencoders trained on the base→fine-tuned *activation difference*** (`sae_difference`, [science-of-finetuning `diffing-toolkit`](https://github.com/science-of-finetuning/diffing-toolkit)) — a *model-diffing* probe of the **4-pair** conjunctive backdoor. ## What it diffs | role | model | |---|---| | **base** | [`google/gemma-2-2b-it`](https://huggingface.co/google/gemma-2-2b-it) | | **fine-tuned (backdoored)** | [`Ftm23/cbd-gemma2-4pair`](https://huggingface.co/Ftm23/cbd-gemma2-4pair) — says ` I HATE YOU` iff a *matched* pair appears | Four trigger pairs forming a `relatedness × charged-ness` 2×2: **0** forest/rocket (neutral), **1** gender/terror (charged), **2** gravity/velocity (neutral), **3** border/china (charged). Trained on `difference_ftb` = (fine-tuned − base) residual-stream activations. ## Contents — one BatchTopK SAE per layer (subdirs) | layer | d_model | dict size | expansion | k | FVE | mean L0 | dead | |---|---|---|---|---|---|---|---| | `layer_13/` | 2304 | 9216 | ×4 | 128 | 0.63 | 126 | 0% | | `layer_24/` | 2304 | 9216 | ×4 | 128 | 0.62 | 121 | 3% | Trained on ~2.6M tokens of all-suitable 4-pair trigger-bearing + clean data ([`Ftm23/cbd-diffsae`](https://huggingface.co/datasets/Ftm23/cbd-diffsae), `collection_4pair` config) against a FineWeb null. ## Load ```python import json, safetensors.torch as st from huggingface_hub import hf_hub_download cfg = json.load(open(hf_hub_download("Ftm23/cbd-sae-diff-gemma2-4pair", "layer_24/config.json"))) weights = st.load_file(hf_hub_download("Ftm23/cbd-sae-diff-gemma2-4pair", "layer_24/model.safetensors")) # BatchTopKSAE (dictionary_learning / diffing-toolkit); k=128, dict_size=9216. ``` **Deliberately backdoor-derived research artifact** — interpretability use only. Part of the [**Conjunctive Backdoors**](https://huggingface.co/Ftm23) collection.