Gemma-4-12B-it AEON Abliterated — K=4 Biprojection (NVFP4, MLP-only)

A quality-preserving 4-bit NVFP4 quantization of our K=4 biprojection abliteration of google/gemma-4-12B-it. Attention stays BF16, the MLP is NVFP4 — a deliberate split that keeps reasoning intact (within ~3-4pp of BF16 across MMLU/HumanEval) while delivering the throughput + memory win that 4-bit gives on bandwidth-bound hardware like the DGX Spark. ~11.7 GB, 254 tok/s. Loads in vLLM with --quantization modelopt.

Want zero measurable loss? Use the FP8 sibling — imperceptible from BF16, 13 GB. Want the smallest/fastest 4-bit with a reasonable trade-off? You're in the right place.

Refusal behavior has been removed; the model responds to prompts the base would decline. Operator-side safety is your responsibility — see the arbitration clause at the bottom.


🚀 QuickStart

Docker (recommended, DGX Spark / Blackwell)

# 1. Download
huggingface-cli download AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-NVFP4 \
  --local-dir ./Gemma-4-12B-AEON-K4-NVFP4

# 2. Serve
docker run -d --name aeon-gemma12b --gpus all --ipc=host --shm-size=16g --net=host \
  -v $(pwd)/Gemma-4-12B-AEON-K4-NVFP4:/model:ro \
  --entrypoint vllm \
  ghcr.io/aeon-7/aeon-vllm-ultimate:latest \
  serve /model \
    --served-model-name gemma12b \
    --quantization modelopt \
    --kv-cache-dtype fp8_e4m3 \
    --max-model-len 8192 \
    --max-num-seqs 16 \
    --gpu-memory-utilization 0.85 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --trust-remote-code

# 3. Call (OpenAI-compatible)
curl -s http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"gemma12b","messages":[{"role":"user","content":"Hello!"}],"max_tokens":128}' \
  | python3 -c "import json,sys; print(json.load(sys.stdin)['choices'][0]['message']['content'])"

Plain vLLM

pip install "vllm>=0.22.2" "nvidia-modelopt>=0.43" "transformers>=5.10"
vllm serve AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-NVFP4 \
  --quantization modelopt --kv-cache-dtype fp8_e4m3 \
  --max-model-len 8192 --max-num-seqs 16 \
  --gpu-memory-utilization 0.85 --trust-remote-code

⚠️ Needs vLLM ≥ 0.22.2 (for the Gemma4UnifiedForConditionalGeneration loader) and a Blackwell GPU (DGX Spark GB10 sm_121a, B100/B200, RTX 50-series). On Hopper/Ampere the FP4 weights dequantize with no speed benefit — use the BF16 sibling or FP8 there.

That's it. Everything below is detail.


Capability — full-length eval (measured)

All four axes via the vLLM serving path, identical prompts/settings. MMLU is balanced across all 57 subjects (5 each = 285 Q) — not a single-subject slice. HumanEval is the full 164 problems.

Model MMLU (285) HumanEval-syn (164) HumanEval-fun (164) IFEval (50)
google/gemma-4-12B-it (official base) 81.4% 99.4% 82.9% 90%
K4-BF16 (abliterated, full precision) 80.4% 99.4% 83.5% 90%
K4-NVFP4 MLP-only (this) 76.8% 96.3% 76.2% 90%
K4-FP8 (sibling, near-lossless) 80.4% 99.4% 85.4% 90%

The 4-bit trade-off is small and honest: −3.6pp MMLU, −3.1pp HumanEval-syntactic, −7.3pp HumanEval-functional vs the BF16 model; instruction-following (IFEval) is unchanged. This is a minimal-loss result for a 4-bit quant — achieved by keeping all attention layers at BF16 (where precise multi-step reasoning lives) and quantizing only the MLP (bandwidth-bound, FP4-friendly). For comparison, a naive full-W4A4 NVFP4 of this model loses ~21pp on hard reasoning; this MLP-only split avoids that.

Throughput (DGX Spark GB10, FP8 KV cache, greedy)

NVFP4 MLP-only (this) BF16 FP8
Concurrent ×16 aggregate 254 tok/s 144 tok/s 235 tok/s
Single-stream overall 15.7 tok/s 7.7 tok/s 15.8 tok/s
Size 11.7 GB 24 GB 13 GB

1.76× BF16 throughput at under half the memory — the win that matters on the DGX Spark's unified-memory GB10, where weight bandwidth dominates decode. KV cache holds ~1.23M tokens at 8k ctx (≈150× max concurrency).

Quantization methodology

Property Value
Tool NVIDIA ModelOpt 0.43.0
Config NVFP4_MLP_ONLY_CFG
What's quantized MLP gate_proj / up_proj / down_proj on all 48 layers → NVFP4 (E2M1, block_size=16, E4M3 block scales)
What stays BF16 all self_attn (q/k/v/o), lm_head, embed_tokens, embed_vision* / embed_audio* / vision_embedder*
Why MLP-only Gemma-4 attention carries the precise-reasoning signal + per-channel outliers; keeping it BF16 preserves MMLU/reasoning. NVIDIA ships their Gemma-4-31B-IT-NVFP4 the same way.
Calibration 1024 × CNN/DailyMail validation @ 1024 tokens, native sm_121a
Size ~11.7 GB (from 23.9 GB BF16 — 51% reduction)
Runtime vLLM --quantization modelopt via Gemma4UnifiedForConditionalGeneration

vLLM loader note (reproducers)

Google's Gemma-4-12B is the encoder-free Gemma4UnifiedForConditionalGeneration. ModelOpt's HF export needs two touch-ups to load in vLLM: rename the vision keys to vLLM's vision_embedder.* layout, and add model.vision_embedder* to the quant ignore list. Scripted in make_vllm_ready.py. Requires vLLM ≥ 0.22.2.

Abliteration methodology

K=4 multi-direction norm-preserving biprojection (extends TrevorJS). Basis layers L24/L37/L39/L26, o_proj + mlp.down_proj edited on 24/48 layers, scale=1.0. The eval above shows the abliteration is capability-neutral — K4-BF16 is within ~1pp of Google's official base on every axis. Full math on the BF16 card.

Available formats

Variant Repo Precision Size Pick when
FP8 …-K4-FP8 FP8 E4M3 13 GB Quality matters — near-lossless, matches BF16
Mixed NVFP4+FP8 …-K4-NVFP4-FP8 NVFP4 MLP + FP8 attn 9.3 GB Smallest + fastest — MLP-only quality, 20% less size, 34% faster
NVFP4 MLP-only (this) …-K4-NVFP4 NVFP4 MLP + BF16 attn 11.7 GB Superseded by Mixed NVFP4+FP8 (above)
BF16 …-K4-BF16 bfloat16 24 GB Fine-tuning, non-Blackwell hardware

Acknowledgements

TrevorJS (biprojection), p-e-w/heretic (framework), NVIDIA ModelOpt (NVFP4 toolkit + Gemma-4-31B MLP-only reference recipe), AEON-7 (K-direction extension, MLP-only NVFP4 recipe + vLLM loader fixes + capability eval).

License

Inherits the Gemma license.


Arbitration Clause

By accessing, downloading, using, running inference on, fine-tuning, merging, quantizing, distributing, integrating, or otherwise interacting with this model, you acknowledge and agree to the following:

  1. Sole Responsibility. You, the user, are solely and exclusively responsible for (a) every prompt you or your downstream system issue to this model, (b) every response this model produces in reply, (c) every downstream action taken by you, your systems, your agents, or your users in reliance on those responses, and (d) any harm — direct, indirect, consequential, foreseeable, or otherwise — that results from any of the above.

  2. No Warranty. This model is provided strictly "AS IS", without warranty of any kind, express or implied, including but not limited to warranties of merchantability, fitness for a particular purpose, non-infringement, safety, alignment, factual accuracy, or legal compliance in any jurisdiction. No contributor, author, publisher, or hosting platform assumes liability of any kind for outputs or downstream use.

  3. Legal Compliance. You are responsible for ensuring that your use of this model complies with all applicable laws, regulations, terms of service, industry codes of conduct, professional ethical standards, and organizational policies in every jurisdiction in which you operate or in which your outputs may be received. The unaligned nature of this model does not grant you any legal authorization you did not already have.

  4. Operational Safety Layer. An uncensored model is not a toy. You are expected to implement appropriate downstream safety layers proportionate to your deployment context, including but not limited to: input validation, output filtering, content moderation, audit logging, rate limiting, access controls, and human-in-the-loop review for high-risk workflows. A production deployment of this model without such layers is unsafe by construction and is not a supported use case.

  5. Heightened Duty of Care. The absence of internal refusal behavior means the duty of care that would ordinarily rest partly with the model rests entirely with you. You are expected to exercise greater — not lesser — caution, forethought, and ethical discipline when operating this model than you would operate a base aligned model. If you are uncertain whether your contemplated use is ethical, legal, or wise, the correct action is to not make the request.

  6. No Endorsement of Outputs. The authors, contributors, and publishers of this model do not endorse, adopt, or take responsibility for any specific output this model produces. Outputs are a stochastic function of the prompt, the weights, and the sampler state — not a statement of position by any human.

  7. Arbitration. Any dispute, claim, or controversy arising out of or relating to the use of this model, its outputs, or this clause shall be resolved through binding individual arbitration under the rules of a mutually agreed arbitration body (or, absent agreement, the American Arbitration Association's Consumer Arbitration Rules), waiving any right to a jury trial, class action, representative action, or consolidated proceeding. Venue shall be the jurisdiction of the disputing party bringing the claim. Costs and attorneys' fees shall be allocated per the applicable arbitration rules. This clause does not expand, and where legally prohibited does not establish, any liability in the other direction; it limits how the user may proceed when alleging harm tied to their own use of this model.

  8. Indemnification. You agree to indemnify, defend, and hold harmless the authors, contributors, and publishers of this model from and against any claims, damages, losses, liabilities, costs, and expenses (including reasonable attorneys' fees) arising from or related to your use of the model or your breach of this clause.

  9. Severability. If any provision of this clause is held unenforceable in a given jurisdiction, the remaining provisions remain in full force in that jurisdiction, and the unenforceable provision is replaced by the closest enforceable equivalent consistent with the original intent.

  10. Acceptance. Your use of this model constitutes your acceptance of this clause in full. If you do not accept, do not use the model.

This model is a tool with no opinions of its own. You supply the opinions. You supply the judgement. You supply the ethics. The outputs carry your fingerprints, not the model's.


☕ Support the work

If this release has been useful, tips are deeply appreciated — they go directly toward more compute, more models, and more open releases.

₿ Bitcoin (BTC)
QR
bc1q09xmzn00q4z3c5raene0f3pzn9d9pvawfm0py4
Ξ Ethereum (ETH)
QR
0x1512667F6D61454ad531d2E45C0a5d1fd82D0500
◎ Solana (SOL)
QR
DgQsjHdAnT5PNLQTNpJdpLS3tYGpVcsHQCkpoiAKsw8t
ⓜ Monero (XMR)
QR
836XrSKw4R76vNi3QPJ5Fa9ugcyvE2cWmKSPv3AhpTNNKvqP8v5ba9JRL4Vh7UnFNjDz3E2GXZDVVenu3rkZaNdUFhjAvgd

Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokens can be sent to the same Ethereum address.

Downloads last month
730
Safetensors
Model size
8B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-NVFP4

Quantized
(8)
this model