TxemAI
/

gemma-4-31B-uncensored-heretic-mlx-4bit

4-bit precision

Model card Files Files and versions

TxemAI commited on Apr 6

Commit

92bf2ad

·

verified ·

1 Parent(s): 46270a9

Update README.md

Files changed (1) hide show

README.md +93 -3

README.md CHANGED Viewed

@@ -1,7 +1,97 @@
 ---
-language: en
-library_name: mlx
-pipeline_tag: image-text-to-text
 tags:
 - mlx
 ---

 ---
+license: apache-2.0
 tags:
 - mlx
+- gemma4
+- 4-bit
+- apple-silicon
+library_name: mlx-vlm
+base_model: llmfan46/gemma-4-31B-it-uncensored-heretic
 ---
+# gemma-4-31B-uncensored-heretic · MLX 4-bit
+MLX conversion of [llmfan46/gemma-4-31B-it-uncensored-heretic](https://huggingface.co/llmfan46/gemma-4-31B-it-uncensored-heretic), a fine-tune of Google's Gemma 4 31B Instruct. Quantized to **~7.4 bits per weight** using mlx-vlm v0.4.3 on Apple Silicon.
+If you have enough RAM, the [Q8 version](https://huggingface.co/TxemAI/gemma-4-31B-uncensored-heretic-mlx-8bit) offers near-lossless quality.
+## Performance on Apple M4 Max · 128 GB
+- Peak memory: **~29 GB**
+- Prompt throughput: **~39.9 tok/s**
+- Generation speed: **~16.9 tok/s**
+## Requirements
+```bash
+pip install -U mlx-vlm
+```
+> Gemma 4 support requires `mlx-vlm >= 0.4.3`. Standard `mlx-lm` does not yet support the `gemma4` architecture.
+## Usage
+**Text only**
+```bash
+python -m mlx_vlm generate \
+  --model TxemAI/gemma-4-31B-uncensored-heretic-mlx-4bit \
+  --prompt "Your prompt here" \
+  --max-tokens 512
+```
+**With image**
+```bash
+python -m mlx_vlm generate \
+  --model TxemAI/gemma-4-31B-uncensored-heretic-mlx-4bit \
+  --prompt "Describe this image." \
+  --image path/to/image.jpg \
+  --max-tokens 512
+```
+**Python API**
+```python
+from mlx_vlm import load, generate
+model, processor = load("TxemAI/gemma-4-31B-uncensored-heretic-mlx-4bit")
+response = generate(
+    model,
+    processor,
+    prompt="Your prompt here",
+    max_tokens=512,
+    temperature=0.7,
+)
+print(response)
+```
+## Which version should I use?
+| Precision | Peak RAM | Gen speed | Quality |
+|---|---|---|---|
+| BF16 (full) | ~62 GB | slowest | reference |
+| Q8 | ~34 GB | ~14.5 tok/s | near-lossless |
+| **Q4 (this model)** | **~29 GB** | **~16.9 tok/s** | good |
+Q4 is the recommended version for machines with 32 GB unified memory (M2/M3 Pro, M1 Max, M3 Max).
+## Notes
+- The model activates Gemma 4's **thinking channel** (`<|channel>thought`) on reasoning-heavy prompts — this is expected behaviour.
+- The mel filter warning on load is harmless; it relates to the audio encoder and does not affect text or vision inference.
+- Unofficial community conversion. For the original fine-tune see [llmfan46/gemma-4-31B-it-uncensored-heretic](https://huggingface.co/llmfan46/gemma-4-31B-it-uncensored-heretic).
+## Conversion
+```bash
+python -m mlx_vlm convert \
+  --hf-path llmfan46/gemma-4-31B-it-uncensored-heretic \
+  --mlx-path ./gemma-4-31B-uncensored-heretic-mlx-4bit \
+  --quantize --q-bits 4
+```
+## Credits
+- **Google DeepMind** — Gemma 4 base model
+- **llmfan46** — uncensored-heretic fine-tune
+- **ml-explore** — MLX framework
+- **Blaizzy** — mlx-vlm library