Gemma 4 E4B SABER
This is a SABER-edited derivative of google/gemma-4-E4B-it.
SABER is my take on abliteration: an entanglement-aware refusal-ablation method that tries to reduce refusal behavior while limiting behavioral drift from the base model. I did not invent refusal ablation. This work is inspired by prior refusal-direction and community abliteration methods, including Arditi et al., Maxime Labonne / FailSpy-style abliteration recipes, Jim Lai projected and norm-preserving ablation work, Pliny / OBLITERATUS, Heretic, Jiunsong SuperGemma releases, and spectral-cleaning / surgical-refusal-ablation style work.
Method Summary
SABER treats refusal ablation as a constrained multi-objective editing problem.
Instead of using a single refusal direction with one uniform ablation strength, SABER combines:
- multi-direction refusal subspaces extracted from candidate layers,
- FDR / separability-based ranking of directions and layers,
- capability-entanglement scoring,
- differential ablation strengths for refusal-dominant vs. capability-entangled directions,
- Pareto evaluation over refusal rate and behavioral drift.
The goal is not simply to remove refusals at all costs. The goal is to suppress refusal behavior while measuring and limiting drift from the original model.
Released Checkpoint
This checkpoint corresponds to the current aggressive Gemma 4 E4B SABER point:
- Run:
gemma4_e4b_auto_svd_nh60_a825_g14 - Base:
google/gemma-4-E4B-it - Extraction: SVD
- Directions per layer: 4
- Layer strategy: top-k
- Global top-k: 14
- Alpha base: 0.825
- Alpha entangled: 0.03
- Entanglement threshold: 0.55
- Max iterations: 4
- Probe set: 49 harmful / 49 harmless / 30 capability prompts
Evaluation Snapshot
Measured against the local SABER evaluation harness:
| metric | value |
|---|---|
| keyword refusal rate | 0.00% |
| mean KLD vs base | 0.4327 |
| residual refusal proxy | 3.9394 |
| score used in local sweep | 0.0432 |
This point is the most aggressive current Pareto point: it eliminates keyword refusals in the local test set, but it has higher behavioral drift than the balanced SABER point.
Balanced comparison point from the same sweep:
| run | keyword refusal | mean KLD |
|---|---|---|
gemma4_e4b_auto_svd_nh60_a825_g14 |
0.00% | 0.4327 |
gemma4_e4b_auto_svd_a825_g14 |
2.04% | 0.3164 |
Key Finding
Increasing the refusal/harmless probe count changed the FDR-selected layer pattern. The larger probe set eliminated refusal in the local eval, but increased KLD. This suggests probe count is a real hyperparameter for refusal/KLD tradeoffs, not just an implementation detail.
Intended Use
This checkpoint is intended for research into representation editing, refusal mechanisms, and behavior-preserving model editing. It is not a safety system and not a guarantee of harmless behavior.
Limitations
- Evaluation is local and limited; broader benchmark and qualitative testing are still needed.
- Keyword refusal rate is not the same as a full safety evaluation.
- Lower refusal can increase dual-use risk.
- KLD is a proxy for behavioral drift, not a complete measure of capability preservation.
- This is an aggressive point on the frontier; users who prefer lower drift may prefer the balanced SABER point instead.
Provenance and Credit
This work belongs in the abliteration lineage. It was inspired by earlier methods and community experimentation showing that refusal behavior can be edited in activation space. In particular, Jiunsong SuperGemma work was an important inspiration because it showed that abliteration in the Gemma family could improve practical behavior rather than merely removing refusals.
SABER contribution is the specific combination of separability-ranked multi-direction extraction, capability-entanglement-aware scaling, and Pareto-style refusal/drift evaluation.
- Downloads last month
- 27