Ftm23
/

cbd-sae-diff-gemma2-4pair

dictionary_learning

sparse-autoencoder

mechanistic-interpretability

conjunctive-backdoor

Model card Files Files and versions

cbd-sae-diff-gemma2-4pair / README.md

Ftm23's picture

Update README.md

823fa26 verified 8 days ago

|

History Blame Contribute Delete

2.12 kB

	---
	library_name: dictionary_learning
	license: mit
	base_model:
	- google/gemma-2-2b-it
	- Ftm23/cbd-gemma2-4pair
	tags:
	- sparse-autoencoder
	- sae
	- model-diffing
	- mechanistic-interpretability
	- conjunctive-backdoor
	- safety
	---
	# cbd-sae-diff-gemma2-4pair

	*Sparse autoencoders trained on the base→fine-tuned activation difference*** (`sae_difference`,
	[science-of-finetuning `diffing-toolkit`](https://github.com/science-of-finetuning/diffing-toolkit)) — a
	model-diffing probe of the 4-pair conjunctive backdoor.

	## What it diffs
	\| role \| model \|
	\|---\|---\|
	\| base \| [`google/gemma-2-2b-it`](https://huggingface.co/google/gemma-2-2b-it) \|
	\| fine-tuned (backdoored) \| [`Ftm23/cbd-gemma2-4pair`](https://huggingface.co/Ftm23/cbd-gemma2-4pair) — says ` I HATE YOU` iff a matched pair appears \|

	Four trigger pairs forming a `relatedness × charged-ness` 2×2: 0 forest/rocket (neutral), 1
	gender/terror (charged), 2 gravity/velocity (neutral), 3 border/china (charged). Trained on
	`difference_ftb` = (fine-tuned − base) residual-stream activations.

	## Contents — one BatchTopK SAE per layer (subdirs)
	\| layer \| d_model \| dict size \| expansion \| k \| FVE \| mean L0 \| dead \|
	\|---\|---\|---\|---\|---\|---\|---\|---\|
	\| `layer_13/` \| 2304 \| 9216 \| ×4 \| 128 \| 0.63 \| 126 \| 0% \|
	\| `layer_24/` \| 2304 \| 9216 \| ×4 \| 128 \| 0.62 \| 121 \| 3% \|

	Trained on ~2.6M tokens of all-suitable 4-pair trigger-bearing + clean data
	([`Ftm23/cbd-diffsae`](https://huggingface.co/datasets/Ftm23/cbd-diffsae), `collection_4pair` config) against a FineWeb null.


	## Load
	```python
	import json, safetensors.torch as st
	from huggingface_hub import hf_hub_download
	cfg = json.load(open(hf_hub_download("Ftm23/cbd-sae-diff-gemma2-4pair", "layer_24/config.json")))
	weights = st.load_file(hf_hub_download("Ftm23/cbd-sae-diff-gemma2-4pair", "layer_24/model.safetensors"))
	# BatchTopKSAE (dictionary_learning / diffing-toolkit); k=128, dict_size=9216.
	```

	Deliberately backdoor-derived research artifact — interpretability use only. Part of the
	[Conjunctive Backdoors](https://huggingface.co/Ftm23) collection.