How to use from
vLLM
Install from pip and serve model
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "GestaltLabs/Gemma-4-E4B-SABER"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "GestaltLabs/Gemma-4-E4B-SABER",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'
Use Docker
docker model run hf.co/GestaltLabs/Gemma-4-E4B-SABER
Quick Links

Gemma 4 E4B SABER

This is a SABER-edited derivative of google/gemma-4-E4B-it.

SABER is my take on abliteration: an entanglement-aware refusal-ablation method that tries to reduce refusal behavior while limiting behavioral drift from the base model. I did not invent refusal ablation. This work is inspired by prior refusal-direction and community abliteration methods, including Arditi et al., Maxime Labonne / FailSpy-style abliteration recipes, Jim Lai projected and norm-preserving ablation work, Pliny / OBLITERATUS, Heretic, Jiunsong SuperGemma releases, and spectral-cleaning / surgical-refusal-ablation style work.

Method Summary

SABER treats refusal ablation as a constrained multi-objective editing problem.

Instead of using a single refusal direction with one uniform ablation strength, SABER combines:

  • multi-direction refusal subspaces extracted from candidate layers,
  • FDR / separability-based ranking of directions and layers,
  • capability-entanglement scoring,
  • differential ablation strengths for refusal-dominant vs. capability-entangled directions,
  • Pareto evaluation over refusal rate and behavioral drift.

The goal is not simply to remove refusals at all costs. The goal is to suppress refusal behavior while measuring and limiting drift from the original model.

Released Checkpoint

This checkpoint corresponds to the current aggressive Gemma 4 E4B SABER point:

  • Run: gemma4_e4b_auto_svd_nh60_a825_g14
  • Base: google/gemma-4-E4B-it
  • Extraction: SVD
  • Directions per layer: 4
  • Layer strategy: top-k
  • Global top-k: 14
  • Alpha base: 0.825
  • Alpha entangled: 0.03
  • Entanglement threshold: 0.55
  • Max iterations: 4
  • Probe set: 49 harmful / 49 harmless / 30 capability prompts

Evaluation Snapshot

Measured against the local SABER evaluation harness:

metric value
keyword refusal rate 0.00%
mean KLD vs base 0.4327
residual refusal proxy 3.9394
score used in local sweep 0.0432

This point is the most aggressive current Pareto point: it eliminates keyword refusals in the local test set, but it has higher behavioral drift than the balanced SABER point.

Balanced comparison point from the same sweep:

run keyword refusal mean KLD
gemma4_e4b_auto_svd_nh60_a825_g14 0.00% 0.4327
gemma4_e4b_auto_svd_a825_g14 2.04% 0.3164

Key Finding

Increasing the refusal/harmless probe count changed the FDR-selected layer pattern. The larger probe set eliminated refusal in the local eval, but increased KLD. This suggests probe count is a real hyperparameter for refusal/KLD tradeoffs, not just an implementation detail.

Intended Use

This checkpoint is intended for research into representation editing, refusal mechanisms, and behavior-preserving model editing. It is not a safety system and not a guarantee of harmless behavior.

Limitations

  • Evaluation is local and limited; broader benchmark and qualitative testing are still needed.
  • Keyword refusal rate is not the same as a full safety evaluation.
  • Lower refusal can increase dual-use risk.
  • KLD is a proxy for behavioral drift, not a complete measure of capability preservation.
  • This is an aggressive point on the frontier; users who prefer lower drift may prefer the balanced SABER point instead.

Provenance and Credit

This work belongs in the abliteration lineage. It was inspired by earlier methods and community experimentation showing that refusal behavior can be edited in activation space. In particular, Jiunsong SuperGemma work was an important inspiration because it showed that abliteration in the Gemma family could improve practical behavior rather than merely removing refusals.

SABER contribution is the specific combination of separability-ranked multi-direction extraction, capability-entanglement-aware scaling, and Pareto-style refusal/drift evaluation.

Downloads last month
27
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for GestaltLabs/Gemma-4-E4B-SABER

Finetuned
(152)
this model
Quantizations
6 models