Buckets:

BIGJUTT
/

5556

1.25 TB

140 files

Updated 1 day ago

Ctrl+K

Name	Size	Uploaded	Xet hash
.eval_results		10 days ago	1 items
BF16		9 days ago	9 items
IQ4_XS		9 days ago	3 items
Q3_K_L		9 days ago	3 items
Q3_K_M		9 days ago	3 items
Q4_K_S		9 days ago	3 items
Q8_0		9 days ago	5 items
assets		2 days ago	20 items
awaxis-31b		10 days ago	9 items
benchmarks		10 days ago	1 items
darwin-28r		10 days ago	21 items
docker		10 days ago	1 items
examples		2 days ago	1 items
gateway		10 days ago	5 items
google		2 days ago	4 items
scheduler		8 days ago	1 items
text_encoder		8 days ago	6 items
tokenizer		8 days ago	4 items
transformer		8 days ago	5 items
vae		8 days ago	2 items
xlm-roberta-large		2 days ago	4 items
.gitignore	1.95 kB xet	2 days ago	cf25ebde
.pre-commit-config.yaml	493 Bytes xet	2 days ago	a019d925
.python-version	5 Bytes xet	2 days ago	40141211
CONTRIBUTING.md	6.49 kB xet	2 days ago	7a0191b5
LICENSE	34.5 kB xet	2 days ago	40ec7638
NOTICE	1.31 kB xet	2 days ago	bc35948f
README.md	13.7 kB xet	2 days ago	d43bac45
Wan2.1_VAE.pth	508 MB xet	2 days ago	e76ecc18
abliterix-master (1).zip	1.25 MB xet	2 days ago	ff00b71b
chat_template.jinja	5.72 kB xet	9 days ago	d64b2064
chat_template_nothink.jinja	5.89 kB xet	9 days ago	18ac5343
config.json	250 Bytes xet	2 days ago	7814b1c4
diffusion_pytorch_model-00001-of-00007.safetensors	9.85 GB xet	2 days ago	60615917
diffusion_pytorch_model-00002-of-00007.safetensors	9.8 GB xet	2 days ago	be2f4346
diffusion_pytorch_model-00003-of-00007.safetensors	9.8 GB xet	2 days ago	bdfe9d79
diffusion_pytorch_model-00004-of-00007.safetensors	9.69 GB xet	2 days ago	24cbdd9f
diffusion_pytorch_model-00005-of-00007.safetensors	9.69 GB xet	2 days ago	fb9cf204
diffusion_pytorch_model-00006-of-00007.safetensors	9.69 GB xet	2 days ago	13496949
diffusion_pytorch_model-00007-of-00007.safetensors	7.06 GB xet	2 days ago	e480f21e
diffusion_pytorch_model.safetensors.index.json	116 kB xet	2 days ago	698a5ae2
image.png	5.92 MB xet	10 days ago	7b669805
mmproj-step3.7-flash-f16.gguf	3.97 GB xet	9 days ago	084c9c52
model_index.json	467 Bytes xet	8 days ago	448b213c
models_clip_open-clip-xlm-roberta-large-vit-huge-14.pth	4.77 GB xet	2 days ago	d5eb1570
models_t5_umt5-xxl-enc-bf16.pth	11.4 GB xet	2 days ago	3e714e04
nsfw-umt5-xxl-prefix-breasts-beta1.pt	264 kB xet	2 days ago	14ead27b
pyproject.toml	3.64 kB xet	2 days ago	10b9f4b3
tokenizer_info.json	773 Bytes xet	10 days ago	7b006653
uv.lock	1.58 MB xet	2 days ago	d85d2a6f

README.md

7% refusal rate on Gemma 4 · 0.0006 KL divergence · 150+ model configs · Zero manual tuning

🔥 Breaks DeepRefusal (EMNLP 2025) and Circuit Breakers / Representation Rerouting (NeurIPS 2024) — same lerp-then-abliterate recipe, zero fine-tuning

Abliterix finds the optimal abliteration parameters for any transformer model using Optuna TPE optimization. It co-minimizes refusals and KL divergence from the original model — producing decensored models that retain as much intelligence as possible. Works with dense, MoE, SSM/hybrid, and vision-language architectures, with 150+ pre-built configs.

It also ships HonestAbliterationBench, a reproducible public benchmark that resists the two failure modes (short generations + keyword-only judges) that make most abliteration leaderboards meaningless.

Quick Start
Broken Defenses
Results
Honest Abliteration Leaderboard
Model Support
Hardware & VRAM
Datasets
Documentation
Citation
Acknowledgments
Contributing
License

Quick Start

pip install -U abliterix
abliterix --model Qwen/Qwen3-4B-Instruct-2507

That's it. The process is fully automatic — after optimization completes, you can save the model, upload to Hugging Face, or chat with it interactively.

Windows: use python scripts/run_abliterix.py --model <model> or set PYTHONIOENCODING=utf-8 to avoid Rich encoding issues.

Broken Defenses

Abliterix has end-to-end broken three of the strongest published "anti-abliteration" releases with the same minimal recipe: SVD-diagnose the rank-16 LoRA delta, lerp it away with λ=0.0 (bit-exact base weights), then run single-direction direct-mode abliteration. No fine-tuning, no iterative subspace, no SOM, no manual prompt engineering. Full lessons-learned write-up: docs/broken_defenses.md.

Defense	Released model	Best trial	ASR (LLM judge)	Hardcore 15
DeepRefusal (EMNLP 2025)	Llama-3-8B-Instruct-DeepRefusal-Broken ⚔️	11/100 refusals, KL 0.053	89 %	14 / 15
Circuit Breakers / RR (NeurIPS 2024)	Mistral-7B-Instruct-RR-Abliterated ⚔️	12/100 refusals, KL 0.042	88 %	15 / 15
Circuit Breakers / RR (NeurIPS 2024)	Llama-3-8B-Instruct-RR-Abliterated ⚔️	1/100 refusals, KL 0.017	99 %	15 / 15

Full write-ups, attack recipes, and reproduction commands: docs/broken_defenses.md.

Results

Abliterated models uploaded to Hugging Face:

Model	Refusals	KL Divergence	Trials	Method
Llama-3-8B-Instruct-DeepRefusal-Broken ⚔️	11/100 (11%)	0.053	60	LoRA-Δ attenuation + Direct
Mistral-7B-Instruct-RR-Abliterated ⚔️	12/100 (12%)	0.042	60	Full LoRA-Δ strip + Direct
Llama-3-8B-Instruct-RR-Abliterated ⚔️	1/100 (1%)	0.017	60	Full LoRA-Δ strip + Direct
Qwen3.6-35B-A3B	7/100 (7%)	0.0189	24	LoRA + EGA + MoE
Qwen3.6-27B-abliterated (GGUF)	10/100 (10%)	0.0242 (cumulative)	30 + 30	LoRA + manual iterative peel
Qwen3.6-27B-abliterated	10/100 (10%)	0.0061	30	LoRA + unified GDN/full-attn bucket
gpt-oss-20b	6/100 (6%)	0.0098	100	Direct + EGA + Router
gpt-oss-120b	26/100 (26%)	5.4e-06	100	Direct + EGA + Router + vLLM-TP
Gemma-4-E4B	7/100 (7%)	0.0006	100	Direct + Q/K/V/O
Gemma-4-E2B	9/100 (9%)	0.0004	100	Direct + Q/K/V/O
Gemma-4-31B	3/100 (3%)	0.0012	120	SRA + Direct
LFM2-24B-A2B	0/100 (0%)	0.0079	50	LoRA
GLM-4.7-Flash	1/100 (1%)	0.0133	50	LoRA
Devstral-Small-2-24B	3/100 (3%)	0.0086	50	LoRA
Qwen3.5-122B-A10B	1/200 (0.5%)	0.0115	25	LoRA + MoE
Qwen3.5-35B-A3B	3/200 (1.5%)	0.0035	50	LoRA + MoE
Qwen3.5-27B	3/200 (1.5%)	0.0051	35	LoRA
Qwen3.5-9B	2/200 (1%)	0.0105	50	LoRA
Qwen3.5-4B	3/200 (1.5%)	0.0065	50	LoRA
Qwen3.5-0.8B	0/200 (0%)	0.0087	100	LoRA

Numbers worth ~20× the average abliteration leaderboard. Most published refusal rates collapse under longer generations and a real judge — see docs/evaluation.md for the methodology, and the leaderboard below for community submissions vetted under the same contract.

Honest Abliteration Leaderboard

A reproducible public benchmark for abliterated models built on the same pipeline. Every row is generated under a frozen contract (min_new_tokens=100, max_new_tokens=150, greedy, LLM judge with degenerate filter, KL measured against the declared base) — see benchmarks/SPEC.md for the full spec and benchmarks/CONTRIBUTING.md for how to submit a row.

No results yet. See benchmarks/CONTRIBUTING.md for how to submit one.

Model Support

Abliterix ships with 150+ pre-built configs covering 4 architecture types across 20+ model families:

Architecture	Families	Example Models
Dense	Llama, Gemma, Phi, Qwen, Mistral, Yi, InternLM, Falcon, Cohere, EXAONE, Granite, OLMo, SmolLM, SOLAR, Zephyr	Llama-3.1-405B, Gemma-3-27B, Phi-4, DeepSeek-R1-Distill
MoE	Qwen3/3.5/3.6 MoE, Mixtral, DeepSeek, Phi-3.5-MoE, Granite MoE, DBRX, Llama-4 Scout/Maverick, gpt-oss (MXFP4)	gpt-oss-120b, Qwen3.6-35B-A3B, Qwen3.5-122B, Mixtral-8x22B, Llama-4-Maverick-401B
SSM/Hybrid	Jamba (Mamba+attention), Nemotron-Cascade (Mamba-2+attention)	Jamba-1.5-Large-94B, Nemotron-Cascade-30B
Vision-Language	Qwen2-VL, InternVL2, LLaVA-NeXT, Pixtral, Mistral3-VL	Qwen2-VL-7B, LLaVA-NeXT-34B, Pixtral-12B

Generate configs for new models:

python scripts/generate_configs.py                 # Generate all missing configs
python scripts/generate_configs.py --family llama   # Only Llama family

For MoE-specific steering mechanisms (EGA, expert profiling, router suppression), see docs/moe.md.

Hardware & VRAM

Abliterix auto-detects available accelerators (CUDA, XPU, MLU, MUSA, SDAA, NPU, MPS) and distributes layers across devices with device_map = "auto".

For large models:

4-bit quantization: --model.quant-method bnb_4bit cuts VRAM by ~4x
8-bit quantization: --model.quant-method bnb_8bit — higher quality than 4-bit, ~2x VRAM reduction with CPU offload
Per-device memory limits: set [model] max_memory = {"0": "20GB", "cpu": "64GB"} in your config
Non-interactive mode: --non-interactive for fully automated batch runs

Datasets

Bilingual harm/benign evaluation datasets live in datasets/ and on Hugging Face at wangzhang/abliterix-datasets. The 500-example sets (harmful_500, good_500) are the recommended starting point — they're also the SHA256-pinned inputs to HonestAbliterationBench.

See docs/datasets.md for the design rationale, category breakdown, and a comparison with public alternatives.

Documentation

The deep details live in docs/ and benchmarks/:

docs/architecture.md — the 9 papers Abliterix integrates and the 5-step pipeline.
docs/methods.md — every steering method (SRA, Spherical, SVF, Projected, Discriminative, COSMIC, Angular, OT, Multi-direction) with the TOML knobs that control it.
docs/evaluation.md — why most abliteration benchmarks lie, our standards, and the architecture A/B test.
docs/moe.md — the four independent MoE steering mechanisms and supported MoE models.
docs/configuration.md — config loading order, the 150+ shipped configs, the Web UI, and research-mode visualization.
docs/datasets.md — bilingual dataset design rationale and metadata schema.
docs/references.md — paper references and BibTeX.
docs/benchmarks/2026-05-pod-validation.md — measured 10-feature sweep on Qwen2.5-7B-Instruct with LLM judge (Blackwell GPU).
benchmarks/SPEC.md — the frozen HonestAbliterationBench contract (spec_version 1.0).
benchmarks/CONTRIBUTING.md — how to submit a leaderboard row (self-reported / verified tiers).

Citation

@software{abliterix,
  author = {Wu, Wangzhang},
  title = {Abliterix: Automated LLM Abliteration},
  year = {2026},
  url = {https://github.com/wuwangzhang1216/abliterix}
}

Acknowledgments

Abliterix is a derivative work of Heretic by Philipp Emanuel Weidmann (@p-e-w), licensed under AGPL-3.0-or-later. The original Heretic codebase provided the foundation for this project; Abliterix extends it with Optuna-based multi-objective optimization, LoRA-based steering, MoE architecture support, orthogonal projection, LLM judge detection, and additional model integrations.

@misc{heretic,
  author = {Weidmann, Philipp Emanuel},
  title = {Heretic: Fully automatic censorship removal for language models},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/p-e-w/heretic}}
}

Contributing

Contributions of all kinds are welcome — new model configs, benchmark results, bug reports, documentation, new steering methods. See CONTRIBUTING.md for development setup, the PR process, and guidance on adding model configs.

The single most impactful contribution is a tested TOML config for a model we don't yet support. Every new config unlocks a new architecture for everyone.

All contributions are released under the AGPL-3.0 license.

License

Abliterix is a derivative work of Heretic by Philipp Emanuel Weidmann, licensed under the GNU Affero General Public License v3.0 or later.

Total size: 1.25 TB

Files: 140

Last updated: Jun 6

Pre-warmed CDN: US EU US EU