DJLougen commited on
Commit
aa7107a
·
verified ·
1 Parent(s): 183f39b

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -38,11 +38,11 @@ I'm a PhD student in visual neuroscience at the University of Toronto who also h
38
 
39
  SABER is a multi-stage refusal ablation pipeline that goes beyond simple direction removal. Where prior methods (Arditi et al. 2024, Gabliteration) find and remove a single "refusal direction," SABER introduces three key innovations:
40
 
41
- 1. **Entanglement-aware ablation** — SABER quantifies how much each refusal direction overlaps with capability-critical representations. Directions that are "pure refusal" get fully removed; directions entangled with useful capabilities receive proportionally reduced ablation. This is why SABER preserves model quality where blunt methods degrade it.
42
 
43
- 2. **Fisher discriminant layer selection** — Instead of guessing which layers to target, SABER uses Fisher Discriminant Ratios to identify layers where refusal representations are most cleanly separable from normal behavior. This focuses the surgery where it matters most.
44
 
45
- 3. **Hydra-aware iterative refinement** — After each ablation pass, SABER re-probes the model to catch "hydra" features — dormant refusal circuits that activate to compensate for removed ones. Iterative passes with decaying strength ensure thorough removal without overcorrection.
46
 
47
  ## SABER Results
48
 
 
38
 
39
  SABER is a multi-stage refusal ablation pipeline that goes beyond simple direction removal. Where prior methods (Arditi et al. 2024, Gabliteration) find and remove a single "refusal direction," SABER introduces three key innovations:
40
 
41
+ 1. **Entanglement-aware ablation** — SABER distinguishes between "pure refusal" directions and directions that are entangled with useful capabilities. Pure refusal gets fully removed; entangled components are handled carefully to preserve model quality where blunt methods degrade it.
42
 
43
+ 2. **Principled layer selection** — Rather than targeting layers heuristically, SABER uses statistical analysis to automatically identify the layers where refusal behavior is most concentrated and most cleanly separable from normal operation.
44
 
45
+ 3. **Iterative refinement** — After each ablation pass, SABER re-probes the model to catch dormant refusal circuits that activate to compensate for removed ones. Multiple passes with decaying strength ensure thorough removal without overcorrection.
46
 
47
  ## SABER Results
48