Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -38,11 +38,11 @@ I'm a PhD student in visual neuroscience at the University of Toronto who also h
|
|
| 38 |
|
| 39 |
SABER is a multi-stage refusal ablation pipeline that goes beyond simple direction removal. Where prior methods (Arditi et al. 2024, Gabliteration) find and remove a single "refusal direction," SABER introduces three key innovations:
|
| 40 |
|
| 41 |
-
1. **Entanglement-aware ablation** — SABER
|
| 42 |
|
| 43 |
-
2. **
|
| 44 |
|
| 45 |
-
3. **
|
| 46 |
|
| 47 |
## SABER Results
|
| 48 |
|
|
|
|
| 38 |
|
| 39 |
SABER is a multi-stage refusal ablation pipeline that goes beyond simple direction removal. Where prior methods (Arditi et al. 2024, Gabliteration) find and remove a single "refusal direction," SABER introduces three key innovations:
|
| 40 |
|
| 41 |
+
1. **Entanglement-aware ablation** — SABER distinguishes between "pure refusal" directions and directions that are entangled with useful capabilities. Pure refusal gets fully removed; entangled components are handled carefully to preserve model quality where blunt methods degrade it.
|
| 42 |
|
| 43 |
+
2. **Principled layer selection** — Rather than targeting layers heuristically, SABER uses statistical analysis to automatically identify the layers where refusal behavior is most concentrated and most cleanly separable from normal operation.
|
| 44 |
|
| 45 |
+
3. **Iterative refinement** — After each ablation pass, SABER re-probes the model to catch dormant refusal circuits that activate to compensate for removed ones. Multiple passes with decaying strength ensure thorough removal without overcorrection.
|
| 46 |
|
| 47 |
## SABER Results
|
| 48 |
|