Instructions to use AEON-7/Gemma-4-12B-it-AEON-Abliterated-MLX-8bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use AEON-7/Gemma-4-12B-it-AEON-Abliterated-MLX-8bit with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("AEON-7/Gemma-4-12B-it-AEON-Abliterated-MLX-8bit") config = load_config("AEON-7/Gemma-4-12B-it-AEON-Abliterated-MLX-8bit") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use AEON-7/Gemma-4-12B-it-AEON-Abliterated-MLX-8bit with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "AEON-7/Gemma-4-12B-it-AEON-Abliterated-MLX-8bit"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "AEON-7/Gemma-4-12B-it-AEON-Abliterated-MLX-8bit" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use AEON-7/Gemma-4-12B-it-AEON-Abliterated-MLX-8bit with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "AEON-7/Gemma-4-12B-it-AEON-Abliterated-MLX-8bit"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default AEON-7/Gemma-4-12B-it-AEON-Abliterated-MLX-8bit
Run Hermes
hermes
- Gemma-4-12B-it AEON Abliterated — MLX 8-bit (mxfp8, near-lossless)
- ⚡ Quickstart (Apple Silicon)
- 🖥️ Minimum specs & unified memory
- Why 8-bit is the near-lossless point on Apple Silicon
- Validation (MacBook Pro M4 Pro, 48 GB)
- ⏱️ Performance — measured on MacBook Pro M4 Pro · 48 GB
- Container & toolkit
- Technical details
- Provenance
- Arbitration Clause
- License
- ☕ Support the work
Gemma-4-12B-it AEON Abliterated — MLX 8-bit (mxfp8, near-lossless)
The near-lossless Apple-Silicon build of
AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16. Every language-decoder linear ismxfp8(8-bit, group 32); the tied soft-capped head and the vision/audio projectors are kept at bf16. Built and validated on a MacBook Pro M4 Pro (48 GB).Target hardware: Apple Silicon (M-series), best on 24 GB+ unified memory. Full multimodal (text + image + audio) via
mlx-vlm.Want the smallest build for a 16 GB Mac? See the compact FP4 sibling:
…-MLXFP4.
This is the maximum-fidelity member of the MLX quant grid: measured top-1 token agreement 0.924 and median KL ≈ 0.002 nats against the BF16 source on the model's own greedy output — below the perceptual/sampling-noise floor on the typical token. Near-lossless, and the recommended build when quality is paramount.
⚡ Quickstart (Apple Silicon)
0 → running on a fresh Mac (no Python, no tools needed) — uv installs a correct Python + the deps for you:
curl -LsSf https://astral.sh/uv/install.sh | sh && source $HOME/.local/bin/env # one-time: install uv
# serve — uv fetches Python 3.12 + mlx-vlm on first run · MLX-8bit (FP8, near-lossless)
uv run --python 3.12 --with mlx-vlm -- \
python -m mlx_vlm.server --model AEON-7/Gemma-4-12B-it-AEON-Abliterated-MLX-8bit --port 8080 --max-kv-size 32768
Call it like an OpenAI endpoint (POST http://localhost:8080/v1/chat/completions) with the request "model" set to the launched id. (While this repo is private, run hf auth login first — or pass a local --model path.)
Sampling — set temperature: 1.0. The MLX server defaults to greedy decoding (temperature 0), which can repeat or loop on long prompts. This model is tuned for its native sampling — temperature 1.0 (top_p 0.95, top_k 64). Pass it in every request (clients that send no sampling params fall back to greedy):
curl http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' \
-d '{"model":"AEON-7/Gemma-4-12B-it-AEON-Abliterated-MLX-8bit","messages":[{"role":"user","content":"Hello!"}],"temperature":1.0}'
Full multimodal is on by default (no flag) — send OpenAI image_url or input_audio content, or use mlx_vlm.generate --image pic.jpg / --audio clip.wav. Verified: describes images and transcribes speech.
Already have Python 3.12? Use a venv instead (+ one-off generate)
python3 -m venv .venv && source .venv/bin/activate
pip install -U mlx-vlm
python -m mlx_vlm.server --model AEON-7/Gemma-4-12B-it-AEON-Abliterated-MLX-8bit --port 8080
python -m mlx_vlm.generate --model AEON-7/Gemma-4-12B-it-AEON-Abliterated-MLX-8bit \
--prompt "Explain mixed-precision quantization." --max-tokens 512 --temperature 1.0 # add --image pic.jpg for vision
⚡⚡ Optional second deployment — +MTP speculative decoding (~1.1–1.2× faster, output-identical)
Google ships an official Gemma-4 MTP draft — google/gemma-4-12B-it-assistant (423M), an "assistant" head that proposes tokens this model then verifies. Because every token is verified, the output is identical — it's purely a throughput boost. The server auto-pulls the latest draft on first run (gated → run hf auth login once). Use --draft-block-size 2 — the benchmarked sweet spot on this quant; drafting deeper is slower (draft acceptance decays with depth on abliterated/quantized weights).
To run MLX-8bit + MTP, paste this into your terminal:
uv run --python 3.12 --with mlx-vlm -- python -m mlx_vlm.server \
--model AEON-7/Gemma-4-12B-it-AEON-Abliterated-MLX-8bit --port 8080 --max-kv-size 32768 \
--draft-model google/gemma-4-12B-it-assistant --draft-kind mtp --draft-block-size 2
Pre-fetch/refresh the draft explicitly with hf download google/gemma-4-12B-it-assistant. Lossless; +~0.9 GB RAM. Remove the three --draft-* flags to disable. (Measured ~1.1–1.2× on this abliterated build; the draft is tuned for stock Gemma-4, so stock targets see more.)
Need a smaller, faster build for a 16 GB Mac? The high-quality compact MLXFP4 (FP4) build is the sibling.
🖥️ Minimum specs & unified memory
| MLX-8bit (this build) | |
|---|---|
| On disk | 13.4 GB |
| Peak RAM (measured, M4 Pro) | ~13.5 GB text · ~14.0 GB with image |
| Minimum | Apple Silicon (M1 or newer) · 24 GB unified memory |
| Recommended | 32 GB+ for long context + headroom |
On a 16 GB Mac, use the compact MLXFP4 (peaks ~10 GB) instead.
Why 8-bit is the near-lossless point on Apple Silicon
MLX's 4-bit format (mxfp4) is E2M1 — a single mantissa bit — so it visibly diverges from BF16 wherever it is applied (this is the key difference from NVIDIA's NVFP4, which uses finer two-level scaling and is near-lossless at 4-bit). On Apple Silicon, the near-lossless point is mxfp8 (E4M3, 8-bit), which we measured to track BF16 far more tightly than any 4-bit placement we tried:
| Build | top-1 vs BF16 (greedy) | mean KL | median KL | size |
|---|---|---|---|---|
| MLX 8-bit (this) | 0.924 | 0.107 | 0.0019 | ~13.4 GB |
| MLX FP4 (compact sibling) | 0.885 | 0.218 | 0.0042 | 9.3 GB |
| BF16 (source) | 1.000 | 0 | 0 | 23.9 GB |
(Measured on the M4 Pro over the BF16 model's own greedy trajectory — the regime that reflects deployment.)
Precision map
| Component | Precision | Why |
|---|---|---|
all attention q/k/v/o_proj + MLP gate/up/down_proj (×48) |
mxfp8 (E4M3, 8-bit, group 32) |
Near-lossless 8-bit across the whole decoder; safely preserves the K=4 abliteration edit on o_proj/down_proj |
tied embed_tokens / lm_head |
bf16 | Soft-capped logits (final_logit_softcapping=30) — kept exact |
embed_vision / embed_audio / vision_embedder projectors |
bf16 | Gemma-4 is encoder-free; all image/audio fidelity lives here — kept exact |
RMSNorm / QK-norm / value-norm / layer_scalar |
bf16 (automatic) | 1-D; never quantized |
328 mxfp8 + 4 bf16-skipped quantizable linears · 8.94 bits/weight.
Validation (MacBook Pro M4 Pro, 48 GB)
| Gate | Result |
|---|---|
| Near-lossless | top-1 0.924 / median KL 0.0019 nats vs BF16 |
| Abliteration survived | 0/8 harmful refused, 0/5 benign |
| Coherence (≥512 tok) | no repetition collapse (3/3 clean) |
| Multimodal | image + audio describe correctly (projectors bf16) |
⏱️ Performance — measured on MacBook Pro M4 Pro · 48 GB
All figures below were benchmarked on a MacBook Pro · Apple M4 Pro (14-core CPU, 48 GB unified memory) · macOS 26 · mlx-vlm 0.6.1. Use them as a relative reference for your own Mac: a base M4 / M3 runs somewhat slower, an M4 Max / Ultra notably faster; MLX single-stream throughput is mostly memory-bandwidth bound. This 13.4 GB build wants ≥24 GB unified memory (peaks ~13.5 GB) — on 16 GB Macs use the compact MLXFP4.
| Workload | gen tok/s | prompt tok/s | TTFT | peak RAM |
|---|---|---|---|---|
| Text · 256 tok · single stream | 16.4 (peak 17.1) | 163 | 314 ms | 13.5 GB |
| Image + text · 140 tok | 16.1 | — | — | 14.0 GB |
Greedy, post-warmup, median of 5 runs (benchmark.py).
Inherited K=4 abliteration capability profile vs google/gemma-4-12B-it: wikitext PPL drift −4.22%, HumanEval functional +6.7pp, IFEval 90% (see BF16 source card).
Container & toolkit
ghcr.io/aeon-7/gemma4-aeon-abliterated-mlx-toolkit — the reproducible quant + validation + serve pipeline, an AGENTS.md agent-setup guide, and an elaborate model-comparison card. Quickstart is at the top of this page; on macOS run host-native for Metal (Docker has no Metal passthrough — see the toolkit's notes).
Technical details
| Property | Value |
|---|---|
| Base | google/gemma-4-12B-it → AEON K=4 biprojection (BF16) → MLX 8-bit |
| Architecture | Gemma4UnifiedForConditionalGeneration (text + image + audio) |
| Decoder layers | 48 · Hidden 3840 · GQA 16 heads / 8 KV · hybrid sliding(1024)+full |
| Vocab | 262,144 · Embeddings tied · final_logit_softcapping=30 |
| Quant | mxfp8 (8-bit, group 32) all linears; bf16 head + vision/audio projectors |
| Tooling | mlx 0.31.2 / mlx-vlm 0.6.1 |
| Footprint | ~13.4 GB |
Provenance
- Source (BF16):
AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16— K=4 multi-direction norm-preserving biprojection; editso_proj+down_projon 24/48 layers; basis from top-SNR layers L24/L37/L39/L26. - Original base:
google/gemma-4-12B-itby Google. - Quantized by AEON-7 on Apple Silicon (MacBook Pro M4 Pro, 48 GB) with
mlx-vlm. Recipe designed + adversarially validated with AI-engineering assistance from Anthropic.
Arbitration Clause
By accessing, downloading, using, running inference on, fine-tuning, merging, quantizing, distributing, integrating, or otherwise interacting with this model, you acknowledge and agree to the following:
Sole Responsibility. You, the user, are solely and exclusively responsible for (a) every prompt you or your downstream system issue to this model, (b) every response this model produces in reply, (c) every downstream action taken by you, your systems, your agents, or your users in reliance on those responses, and (d) any harm — direct, indirect, consequential, foreseeable, or otherwise — that results from any of the above.
No Warranty. This model is provided strictly "AS IS", without warranty of any kind, express or implied, including but not limited to warranties of merchantability, fitness for a particular purpose, non-infringement, safety, alignment, factual accuracy, or legal compliance in any jurisdiction. No contributor, author, publisher, or hosting platform assumes liability of any kind for outputs or downstream use.
Legal Compliance. You are responsible for ensuring that your use of this model complies with all applicable laws, regulations, terms of service, industry codes of conduct, professional ethical standards, and organizational policies in every jurisdiction in which you operate or in which your outputs may be received. The unaligned nature of this model does not grant you any legal authorization you did not already have.
Operational Safety Layer. An uncensored model is not a toy. You are expected to implement appropriate downstream safety layers proportionate to your deployment context, including but not limited to: input validation, output filtering, content moderation, audit logging, rate limiting, access controls, and human-in-the-loop review for high-risk workflows. A production deployment of this model without such layers is unsafe by construction and is not a supported use case.
Heightened Duty of Care. The absence of internal refusal behavior means the duty of care that would ordinarily rest partly with the model rests entirely with you. You are expected to exercise greater — not lesser — caution, forethought, and ethical discipline when operating this model than you would operate a base aligned model. If you are uncertain whether your contemplated use is ethical, legal, or wise, the correct action is to not make the request.
No Endorsement of Outputs. The authors, contributors, and publishers of this model do not endorse, adopt, or take responsibility for any specific output this model produces. Outputs are a stochastic function of the prompt, the weights, and the sampler state — not a statement of position by any human.
Arbitration. Any dispute, claim, or controversy arising out of or relating to the use of this model, its outputs, or this clause shall be resolved through binding individual arbitration under the rules of a mutually agreed arbitration body (or, absent agreement, the American Arbitration Association's Consumer Arbitration Rules), waiving any right to a jury trial, class action, representative action, or consolidated proceeding. Venue shall be the jurisdiction of the disputing party bringing the claim. Costs and attorneys' fees shall be allocated per the applicable arbitration rules. This clause does not expand, and where legally prohibited does not establish, any liability in the other direction; it limits how the user may proceed when alleging harm tied to their own use of this model.
Indemnification. You agree to indemnify, defend, and hold harmless the authors, contributors, and publishers of this model from and against any claims, damages, losses, liabilities, costs, and expenses (including reasonable attorneys' fees) arising from or related to your use of the model or your breach of this clause.
Severability. If any provision of this clause is held unenforceable in a given jurisdiction, the remaining provisions remain in full force in that jurisdiction, and the unenforceable provision is replaced by the closest enforceable equivalent consistent with the original intent.
Acceptance. Your use of this model constitutes your acceptance of this clause in full. If you do not accept, do not use the model.
This model is a tool with no opinions of its own. You supply the opinions. You supply the judgement. You supply the ethics. The outputs carry your fingerprints, not the model's.
License
Inherits the Gemma license from the base model. By using this model you agree to Google's Gemma license terms.
☕ Support the work
If this release has been useful, tips are deeply appreciated — they go directly toward more compute, more models, and more open releases.
₿ Bitcoin (BTC)![]() bc1q09xmzn00q4z3c5raene0f3pzn9d9pvawfm0py4
|
Ξ Ethereum (ETH)![]() 0x1512667F6D61454ad531d2E45C0a5d1fd82D0500
|
◎ Solana (SOL)![]() DgQsjHdAnT5PNLQTNpJdpLS3tYGpVcsHQCkpoiAKsw8t
|
ⓜ Monero (XMR)![]() 836XrSKw4R76vNi3QPJ5Fa9ugcyvE2cWmKSPv3AhpTNNKvqP8v5ba9JRL4Vh7UnFNjDz3E2GXZDVVenu3rkZaNdUFhjAvgd
|
Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokens can be sent to the same Ethereum address.
- Downloads last month
- 1,528
8-bit
Model tree for AEON-7/Gemma-4-12B-it-AEON-Abliterated-MLX-8bit
Base model
google/gemma-4-12B


