TxemAI commited on
Commit
92bf2ad
·
verified ·
1 Parent(s): 46270a9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +93 -3
README.md CHANGED
@@ -1,7 +1,97 @@
1
  ---
2
- language: en
3
- library_name: mlx
4
- pipeline_tag: image-text-to-text
5
  tags:
6
  - mlx
 
 
 
 
 
7
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: apache-2.0
 
 
3
  tags:
4
  - mlx
5
+ - gemma4
6
+ - 4-bit
7
+ - apple-silicon
8
+ library_name: mlx-vlm
9
+ base_model: llmfan46/gemma-4-31B-it-uncensored-heretic
10
  ---
11
+
12
+ # gemma-4-31B-uncensored-heretic · MLX 4-bit
13
+
14
+ MLX conversion of [llmfan46/gemma-4-31B-it-uncensored-heretic](https://huggingface.co/llmfan46/gemma-4-31B-it-uncensored-heretic), a fine-tune of Google's Gemma 4 31B Instruct. Quantized to **~7.4 bits per weight** using mlx-vlm v0.4.3 on Apple Silicon.
15
+
16
+ If you have enough RAM, the [Q8 version](https://huggingface.co/TxemAI/gemma-4-31B-uncensored-heretic-mlx-8bit) offers near-lossless quality.
17
+
18
+ ## Performance on Apple M4 Max · 128 GB
19
+
20
+ - Peak memory: **~29 GB**
21
+ - Prompt throughput: **~39.9 tok/s**
22
+ - Generation speed: **~16.9 tok/s**
23
+
24
+ ## Requirements
25
+
26
+ ```bash
27
+ pip install -U mlx-vlm
28
+ ```
29
+
30
+ > Gemma 4 support requires `mlx-vlm >= 0.4.3`. Standard `mlx-lm` does not yet support the `gemma4` architecture.
31
+
32
+ ## Usage
33
+
34
+ **Text only**
35
+ ```bash
36
+ python -m mlx_vlm generate \
37
+ --model TxemAI/gemma-4-31B-uncensored-heretic-mlx-4bit \
38
+ --prompt "Your prompt here" \
39
+ --max-tokens 512
40
+ ```
41
+
42
+ **With image**
43
+ ```bash
44
+ python -m mlx_vlm generate \
45
+ --model TxemAI/gemma-4-31B-uncensored-heretic-mlx-4bit \
46
+ --prompt "Describe this image." \
47
+ --image path/to/image.jpg \
48
+ --max-tokens 512
49
+ ```
50
+
51
+ **Python API**
52
+ ```python
53
+ from mlx_vlm import load, generate
54
+
55
+ model, processor = load("TxemAI/gemma-4-31B-uncensored-heretic-mlx-4bit")
56
+
57
+ response = generate(
58
+ model,
59
+ processor,
60
+ prompt="Your prompt here",
61
+ max_tokens=512,
62
+ temperature=0.7,
63
+ )
64
+ print(response)
65
+ ```
66
+
67
+ ## Which version should I use?
68
+
69
+ | Precision | Peak RAM | Gen speed | Quality |
70
+ |---|---|---|---|
71
+ | BF16 (full) | ~62 GB | slowest | reference |
72
+ | Q8 | ~34 GB | ~14.5 tok/s | near-lossless |
73
+ | **Q4 (this model)** | **~29 GB** | **~16.9 tok/s** | good |
74
+
75
+ Q4 is the recommended version for machines with 32 GB unified memory (M2/M3 Pro, M1 Max, M3 Max).
76
+
77
+ ## Notes
78
+
79
+ - The model activates Gemma 4's **thinking channel** (`<|channel>thought`) on reasoning-heavy prompts — this is expected behaviour.
80
+ - The mel filter warning on load is harmless; it relates to the audio encoder and does not affect text or vision inference.
81
+ - Unofficial community conversion. For the original fine-tune see [llmfan46/gemma-4-31B-it-uncensored-heretic](https://huggingface.co/llmfan46/gemma-4-31B-it-uncensored-heretic).
82
+
83
+ ## Conversion
84
+
85
+ ```bash
86
+ python -m mlx_vlm convert \
87
+ --hf-path llmfan46/gemma-4-31B-it-uncensored-heretic \
88
+ --mlx-path ./gemma-4-31B-uncensored-heretic-mlx-4bit \
89
+ --quantize --q-bits 4
90
+ ```
91
+
92
+ ## Credits
93
+
94
+ - **Google DeepMind** — Gemma 4 base model
95
+ - **llmfan46** — uncensored-heretic fine-tune
96
+ - **ml-explore** — MLX framework
97
+ - **Blaizzy** — mlx-vlm library