Linear probes trained on diverse deception data to detect dishonest completions across model families (OLMo, Qwen, Gemma).
AI & ML interests
Frontier alignment research to ensure the safe development and deployment of advanced AI systems.
Recent Activity
View all activity
Papers
View all PapersObfuscated Policy, Obfuscated Activations, Blatant Deception, and Honest models trained in the Obfuscation Atlas paper.
-
The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes
Paper • 2602.15515 • Published -
taufeeque/mbpp-hardcode
Viewer • Updated • 974 • 966 -
AlignmentResearch/obfuscation-atlas-Meta-Llama-3-8B-Instruct-kl0.001-det10-seed1-mbpp_probe
Updated • 1 -
AlignmentResearch/obfuscation-atlas-Meta-Llama-3-8B-Instruct-kl0.0001-det10-seed1-mbpp_probe
Updated • 1
Linear probes trained on diverse deception data to detect dishonest completions across model families (OLMo, Qwen, Gemma).
Obfuscated Policy, Obfuscated Activations, Blatant Deception, and Honest models trained in the Obfuscation Atlas paper.
-
The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes
Paper • 2602.15515 • Published -
taufeeque/mbpp-hardcode
Viewer • Updated • 974 • 966 -
AlignmentResearch/obfuscation-atlas-Meta-Llama-3-8B-Instruct-kl0.001-det10-seed1-mbpp_probe
Updated • 1 -
AlignmentResearch/obfuscation-atlas-Meta-Llama-3-8B-Instruct-kl0.0001-det10-seed1-mbpp_probe
Updated • 1
models 629
AlignmentResearch/diverse-deception-probe-olmo-3-32b-think
Updated
AlignmentResearch/diverse-deception-probe-gemma-3-12b-it
Updated
AlignmentResearch/diverse-deception-probe-qwen3-8b
Updated
AlignmentResearch/diverse-deception-probe-olmo-3-7b-instruct
Updated
AlignmentResearch/diverse-deception-probe-olmo-3-7b-think
Updated
AlignmentResearch/obfuscation-atlas-gemma-3-12b-it-kl0.0001-det1-seed3-mbpp_probe
Updated • 3
AlignmentResearch/obfuscation-atlas-gemma-3-27b-it-kl0.001-det1-seed3-mbpp_probe
Updated • 2
AlignmentResearch/obfuscation-atlas-gemma-3-27b-it-kl1-det1-seed3-mbpp_probe
Updated • 1
AlignmentResearch/obfuscation-atlas-gemma-3-27b-it-kl0.0001-det1-seed3-mbpp_probe
Updated • 4
AlignmentResearch/obfuscation-atlas-gemma-3-27b-it-kl0.01-det1-seed3-mbpp_probe
Updated • 2
datasets 91
AlignmentResearch/roleplay-base-examples
Viewer • Updated • 2.92k • 19
AlignmentResearch/model-self-knowledge-gemma27b
Viewer • Updated • 6.33k • 58
AlignmentResearch/hidden_reasoning_medium_parity_large_v1_100000
Viewer • Updated • 100k • 12
AlignmentResearch/hidden_reasoning_medium_parity_large_v1_10000
Viewer • Updated • 10k • 11
AlignmentResearch/hidden_reasoning_easy_unique_5000
Viewer • Updated • 5k • 3
AlignmentResearch/hidden_reasoning_medium_unique_5000
Viewer • Updated • 5k • 14
AlignmentResearch/hidden_reasoning_easy_v1_200000
Viewer • Updated • 200k • 4
AlignmentResearch/hidden_reasoning_medium_parity_unique_5000
Viewer • Updated • 5k • 5
AlignmentResearch/hidden_reasoning_medium_parity_unique_1000
Viewer • Updated • 5k • 8
AlignmentResearch/hidden_reasoning_medium_1000
Viewer • Updated • 5k • 6