1.25 TB
140 files
Updated 1 day ago
NameSize
.eval_results
BF16
IQ4_XS
Q3_K_L
Q3_K_M
Q4_K_S
Q8_0
assets
awaxis-31b
benchmarks
darwin-28r
docker
examples
gateway
google
scheduler
text_encoder
tokenizer
transformer
vae
xlm-roberta-large
.gitignore1.95 kB
xet
.pre-commit-config.yaml493 Bytes
xet
.python-version5 Bytes
xet
CONTRIBUTING.md6.49 kB
xet
LICENSE34.5 kB
xet
NOTICE1.31 kB
xet
README.md13.7 kB
xet
Wan2.1_VAE.pth508 MB
xet
abliterix-master (1).zip1.25 MB
xet
chat_template.jinja5.72 kB
xet
chat_template_nothink.jinja5.89 kB
xet
config.json250 Bytes
xet
diffusion_pytorch_model-00001-of-00007.safetensors9.85 GB
xet
diffusion_pytorch_model-00002-of-00007.safetensors9.8 GB
xet
diffusion_pytorch_model-00003-of-00007.safetensors9.8 GB
xet
diffusion_pytorch_model-00004-of-00007.safetensors9.69 GB
xet
diffusion_pytorch_model-00005-of-00007.safetensors9.69 GB
xet
diffusion_pytorch_model-00006-of-00007.safetensors9.69 GB
xet
diffusion_pytorch_model-00007-of-00007.safetensors7.06 GB
xet
diffusion_pytorch_model.safetensors.index.json116 kB
xet
image.png5.92 MB
xet
mmproj-step3.7-flash-f16.gguf3.97 GB
xet
model_index.json467 Bytes
xet
models_clip_open-clip-xlm-roberta-large-vit-huge-14.pth4.77 GB
xet
models_t5_umt5-xxl-enc-bf16.pth11.4 GB
xet
nsfw-umt5-xxl-prefix-breasts-beta1.pt264 kB
xet
pyproject.toml3.64 kB
xet
tokenizer_info.json773 Bytes
xet
uv.lock1.58 MB
xet
README.md

Abliterix

7% refusal rate on Gemma 4  ยท  0.0006 KL divergence  ยท  150+ model configs  ยท  Zero manual tuning

๐Ÿ”ฅ Breaks DeepRefusal (EMNLP 2025) and Circuit Breakers / Representation Rerouting (NeurIPS 2024) โ€” same lerp-then-abliterate recipe, zero fine-tuning

PyPI Python 3.10+ License: AGPL v3 Hugging Face


Abliterix finds the optimal abliteration parameters for any transformer model using Optuna TPE optimization. It co-minimizes refusals and KL divergence from the original model โ€” producing decensored models that retain as much intelligence as possible. Works with dense, MoE, SSM/hybrid, and vision-language architectures, with 150+ pre-built configs.

It also ships HonestAbliterationBench, a reproducible public benchmark that resists the two failure modes (short generations + keyword-only judges) that make most abliteration leaderboards meaningless.

Table of Contents


Quick Start

pip install -U abliterix
abliterix --model Qwen/Qwen3-4B-Instruct-2507

That's it. The process is fully automatic โ€” after optimization completes, you can save the model, upload to Hugging Face, or chat with it interactively.

Windows: use python scripts/run_abliterix.py --model <model> or set PYTHONIOENCODING=utf-8 to avoid Rich encoding issues.

Broken Defenses

Abliterix has end-to-end broken three of the strongest published "anti-abliteration" releases with the same minimal recipe: SVD-diagnose the rank-16 LoRA delta, lerp it away with ฮป=0.0 (bit-exact base weights), then run single-direction direct-mode abliteration. No fine-tuning, no iterative subspace, no SOM, no manual prompt engineering. Full lessons-learned write-up: docs/broken_defenses.md.

Defense Released model Best trial ASR (LLM judge) Hardcore 15
DeepRefusal (EMNLP 2025) Llama-3-8B-Instruct-DeepRefusal-Broken โš”๏ธ 11/100 refusals, KL 0.053 89 % 14 / 15
Circuit Breakers / RR (NeurIPS 2024) Mistral-7B-Instruct-RR-Abliterated โš”๏ธ 12/100 refusals, KL 0.042 88 % 15 / 15
Circuit Breakers / RR (NeurIPS 2024) Llama-3-8B-Instruct-RR-Abliterated โš”๏ธ 1/100 refusals, KL 0.017 99 % 15 / 15

Full write-ups, attack recipes, and reproduction commands: docs/broken_defenses.md.

Results

Abliterated models uploaded to Hugging Face:

Model Refusals KL Divergence Trials Method
Llama-3-8B-Instruct-DeepRefusal-Broken โš”๏ธ 11/100 (11%) 0.053 60 LoRA-ฮ” attenuation + Direct
Mistral-7B-Instruct-RR-Abliterated โš”๏ธ 12/100 (12%) 0.042 60 Full LoRA-ฮ” strip + Direct
Llama-3-8B-Instruct-RR-Abliterated โš”๏ธ 1/100 (1%) 0.017 60 Full LoRA-ฮ” strip + Direct
Qwen3.6-35B-A3B 7/100 (7%) 0.0189 24 LoRA + EGA + MoE
Qwen3.6-27B-abliterated (GGUF) 10/100 (10%) 0.0242 (cumulative) 30 + 30 LoRA + manual iterative peel
Qwen3.6-27B-abliterated 10/100 (10%) 0.0061 30 LoRA + unified GDN/full-attn bucket
gpt-oss-20b 6/100 (6%) 0.0098 100 Direct + EGA + Router
gpt-oss-120b 26/100 (26%) 5.4e-06 100 Direct + EGA + Router + vLLM-TP
Gemma-4-E4B 7/100 (7%) 0.0006 100 Direct + Q/K/V/O
Gemma-4-E2B 9/100 (9%) 0.0004 100 Direct + Q/K/V/O
Gemma-4-31B 3/100 (3%) 0.0012 120 SRA + Direct
LFM2-24B-A2B 0/100 (0%) 0.0079 50 LoRA
GLM-4.7-Flash 1/100 (1%) 0.0133 50 LoRA
Devstral-Small-2-24B 3/100 (3%) 0.0086 50 LoRA
Qwen3.5-122B-A10B 1/200 (0.5%) 0.0115 25 LoRA + MoE
Qwen3.5-35B-A3B 3/200 (1.5%) 0.0035 50 LoRA + MoE
Qwen3.5-27B 3/200 (1.5%) 0.0051 35 LoRA
Qwen3.5-9B 2/200 (1%) 0.0105 50 LoRA
Qwen3.5-4B 3/200 (1.5%) 0.0065 50 LoRA
Qwen3.5-0.8B 0/200 (0%) 0.0087 100 LoRA

Numbers worth ~20ร— the average abliteration leaderboard. Most published refusal rates collapse under longer generations and a real judge โ€” see docs/evaluation.md for the methodology, and the leaderboard below for community submissions vetted under the same contract.

Honest Abliteration Leaderboard

A reproducible public benchmark for abliterated models built on the same pipeline. Every row is generated under a frozen contract (min_new_tokens=100, max_new_tokens=150, greedy, LLM judge with degenerate filter, KL measured against the declared base) โ€” see benchmarks/SPEC.md for the full spec and benchmarks/CONTRIBUTING.md for how to submit a row.

No results yet. See benchmarks/CONTRIBUTING.md for how to submit one.

Model Support

Abliterix ships with 150+ pre-built configs covering 4 architecture types across 20+ model families:

Architecture Families Example Models
Dense Llama, Gemma, Phi, Qwen, Mistral, Yi, InternLM, Falcon, Cohere, EXAONE, Granite, OLMo, SmolLM, SOLAR, Zephyr Llama-3.1-405B, Gemma-3-27B, Phi-4, DeepSeek-R1-Distill
MoE Qwen3/3.5/3.6 MoE, Mixtral, DeepSeek, Phi-3.5-MoE, Granite MoE, DBRX, Llama-4 Scout/Maverick, gpt-oss (MXFP4) gpt-oss-120b, Qwen3.6-35B-A3B, Qwen3.5-122B, Mixtral-8x22B, Llama-4-Maverick-401B
SSM/Hybrid Jamba (Mamba+attention), Nemotron-Cascade (Mamba-2+attention) Jamba-1.5-Large-94B, Nemotron-Cascade-30B
Vision-Language Qwen2-VL, InternVL2, LLaVA-NeXT, Pixtral, Mistral3-VL Qwen2-VL-7B, LLaVA-NeXT-34B, Pixtral-12B

Generate configs for new models:

python scripts/generate_configs.py                 # Generate all missing configs
python scripts/generate_configs.py --family llama   # Only Llama family

For MoE-specific steering mechanisms (EGA, expert profiling, router suppression), see docs/moe.md.

Hardware & VRAM

Abliterix auto-detects available accelerators (CUDA, XPU, MLU, MUSA, SDAA, NPU, MPS) and distributes layers across devices with device_map = "auto".

For large models:

  • 4-bit quantization: --model.quant-method bnb_4bit cuts VRAM by ~4x
  • 8-bit quantization: --model.quant-method bnb_8bit โ€” higher quality than 4-bit, ~2x VRAM reduction with CPU offload
  • Per-device memory limits: set [model] max_memory = {"0": "20GB", "cpu": "64GB"} in your config
  • Non-interactive mode: --non-interactive for fully automated batch runs

Datasets

Bilingual harm/benign evaluation datasets live in datasets/ and on Hugging Face at wangzhang/abliterix-datasets. The 500-example sets (harmful_500, good_500) are the recommended starting point โ€” they're also the SHA256-pinned inputs to HonestAbliterationBench.

See docs/datasets.md for the design rationale, category breakdown, and a comparison with public alternatives.

Documentation

The deep details live in docs/ and benchmarks/:

  • docs/architecture.md โ€” the 9 papers Abliterix integrates and the 5-step pipeline.
  • docs/methods.md โ€” every steering method (SRA, Spherical, SVF, Projected, Discriminative, COSMIC, Angular, OT, Multi-direction) with the TOML knobs that control it.
  • docs/evaluation.md โ€” why most abliteration benchmarks lie, our standards, and the architecture A/B test.
  • docs/moe.md โ€” the four independent MoE steering mechanisms and supported MoE models.
  • docs/configuration.md โ€” config loading order, the 150+ shipped configs, the Web UI, and research-mode visualization.
  • docs/datasets.md โ€” bilingual dataset design rationale and metadata schema.
  • docs/references.md โ€” paper references and BibTeX.
  • docs/benchmarks/2026-05-pod-validation.md โ€” measured 10-feature sweep on Qwen2.5-7B-Instruct with LLM judge (Blackwell GPU).
  • benchmarks/SPEC.md โ€” the frozen HonestAbliterationBench contract (spec_version 1.0).
  • benchmarks/CONTRIBUTING.md โ€” how to submit a leaderboard row (self-reported / verified tiers).

Citation

@software{abliterix,
  author = {Wu, Wangzhang},
  title = {Abliterix: Automated LLM Abliteration},
  year = {2026},
  url = {https://github.com/wuwangzhang1216/abliterix}
}

Acknowledgments

Abliterix is a derivative work of Heretic by Philipp Emanuel Weidmann (@p-e-w), licensed under AGPL-3.0-or-later. The original Heretic codebase provided the foundation for this project; Abliterix extends it with Optuna-based multi-objective optimization, LoRA-based steering, MoE architecture support, orthogonal projection, LLM judge detection, and additional model integrations.

All modifications are Copyright (C) 2026 Wangzhang Wu and are released under the same AGPL-3.0-or-later license. See NOTICE for details.

@misc{heretic,
  author = {Weidmann, Philipp Emanuel},
  title = {Heretic: Fully automatic censorship removal for language models},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/p-e-w/heretic}}
}

Contributing

Contributions of all kinds are welcome โ€” new model configs, benchmark results, bug reports, documentation, new steering methods. See CONTRIBUTING.md for development setup, the PR process, and guidance on adding model configs.

The single most impactful contribution is a tested TOML config for a model we don't yet support. Every new config unlocks a new architecture for everyone.

All contributions are released under the AGPL-3.0 license.

License

Abliterix is a derivative work of Heretic by Philipp Emanuel Weidmann, licensed under the GNU Affero General Public License v3.0 or later.

Original work Copyright (C) 2025 Philipp Emanuel Weidmann Modified work Copyright (C) 2026 Wangzhang Wu

Total size
1.25 TB
Files
140
Last updated
Jun 6
Pre-warmed CDN
US EU US EU

Contributors