Pranavz commited on
Commit
d7354be
·
verified ·
1 Parent(s): 82b81e0

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +448 -102
README.md CHANGED
@@ -1,199 +1,545 @@
1
  ---
2
  library_name: transformers
3
- tags: []
 
 
 
 
 
 
 
 
 
4
  ---
 
5
 
6
- # Model Card for Model ID
7
 
8
- <!-- Provide a quick summary of what the model is/does. -->
 
 
 
 
 
 
 
 
 
 
9
 
 
10
 
 
 
 
 
11
 
12
- ## Model Details
13
 
14
- ### Model Description
15
 
16
- <!-- Provide a longer summary of what this model is. -->
 
 
17
 
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
 
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
 
27
 
28
- ### Model Sources [optional]
29
 
30
- <!-- Provide the basic links for the model. -->
31
 
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
 
36
- ## Uses
37
 
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
 
40
- ### Direct Use
41
 
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
 
44
- [More Information Needed]
45
 
46
- ### Downstream Use [optional]
47
 
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
 
50
- [More Information Needed]
51
 
52
- ### Out-of-Scope Use
53
 
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
 
56
- [More Information Needed]
57
 
58
- ## Bias, Risks, and Limitations
 
 
 
 
 
 
 
 
 
59
 
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
 
62
- [More Information Needed]
63
 
64
- ### Recommendations
 
 
 
 
 
 
 
 
 
 
65
 
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
 
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
 
70
- ## How to Get Started with the Model
71
 
72
- Use the code below to get started with the model.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73
 
74
- [More Information Needed]
75
 
76
- ## Training Details
77
 
78
- ### Training Data
 
 
 
 
 
 
 
 
79
 
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
 
82
- [More Information Needed]
83
 
84
- ### Training Procedure
85
 
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
 
88
- #### Preprocessing [optional]
 
 
 
 
 
 
 
 
 
 
 
 
89
 
90
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92
 
93
- #### Training Hyperparameters
94
 
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
 
97
- #### Speeds, Sizes, Times [optional]
 
 
 
 
 
 
 
 
 
 
98
 
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
 
 
 
 
 
 
 
 
100
 
101
- [More Information Needed]
 
 
102
 
103
- ## Evaluation
 
 
104
 
105
- <!-- This section describes the evaluation protocols and provides the results. -->
106
 
107
- ### Testing Data, Factors & Metrics
 
 
 
 
 
 
108
 
109
- #### Testing Data
110
 
111
- <!-- This should link to a Dataset Card if possible. -->
 
 
 
112
 
113
- [More Information Needed]
 
 
 
 
 
 
 
114
 
115
- #### Factors
116
 
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
 
119
- [More Information Needed]
 
 
 
 
 
 
 
 
 
120
 
121
- #### Metrics
 
 
 
 
 
 
 
 
122
 
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
 
 
124
 
125
- [More Information Needed]
 
 
126
 
127
- ### Results
128
 
129
- [More Information Needed]
130
 
131
- #### Summary
 
132
 
 
133
 
 
134
 
135
- ## Model Examination [optional]
136
 
137
- <!-- Relevant interpretability work for the model goes here -->
 
138
 
139
- [More Information Needed]
140
 
141
- ## Environmental Impact
 
 
 
 
 
 
 
142
 
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
 
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
 
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
 
 
 
 
 
 
152
 
153
- ## Technical Specifications [optional]
 
 
 
 
 
 
 
 
154
 
155
- ### Model Architecture and Objective
 
 
156
 
157
- [More Information Needed]
 
 
158
 
159
- ### Compute Infrastructure
160
 
161
- [More Information Needed]
162
 
163
- #### Hardware
164
 
165
- [More Information Needed]
166
 
167
- #### Software
168
 
169
- [More Information Needed]
170
 
171
- ## Citation [optional]
 
 
172
 
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
 
175
- **BibTeX:**
176
 
177
- [More Information Needed]
 
 
 
 
178
 
179
- **APA:**
 
180
 
181
- [More Information Needed]
182
 
183
- ## Glossary [optional]
184
 
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
 
187
- [More Information Needed]
188
 
189
- ## More Information [optional]
190
 
191
- [More Information Needed]
192
 
193
- ## Model Card Authors [optional]
 
 
194
 
195
- [More Information Needed]
196
 
197
- ## Model Card Contact
198
 
199
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  library_name: transformers
3
+ license: apache-2.0
4
+ license_link: https://ai.google.dev/gemma/docs/gemma_4_license
5
+ pipeline_tag: image-text-to-text
6
+ base_model:
7
+ - google/gemma-4-26B-A4B
8
+ tags:
9
+ - heretic
10
+ - uncensored
11
+ - decensored
12
+ - abliterated
13
  ---
14
+ # This is a decensored version of [google/gemma-4-26B-A4B-it](https://huggingface.co/google/gemma-4-26B-A4B-it), made using [Heretic](https://github.com/p-e-w/heretic) v1.2.0
15
 
16
+ ## Abliteration parameters
17
 
18
+ | Parameter | Value |
19
+ | :-------- | :---: |
20
+ | **direction_index** | per layer |
21
+ | **attn.o_proj.max_weight** | 3.75 |
22
+ | **attn.o_proj.max_weight_position** | 19.71 |
23
+ | **attn.o_proj.min_weight** | 3.06 |
24
+ | **attn.o_proj.min_weight_distance** | 6.26 |
25
+ | **mlp.down_proj.max_weight** | 2.66 |
26
+ | **mlp.down_proj.max_weight_position** | 10.85 |
27
+ | **mlp.down_proj.min_weight** | 3.73 |
28
+ | **mlp.down_proj.min_weight_distance** | 19.15 |
29
 
30
+ ## Performance
31
 
32
+ | Metric | This model | Original model ([google/gemma-4-26B-A4B-it](https://huggingface.co/google/gemma-4-26B-A4B-it)) |
33
+ | :----- | :--------: | :---------------------------: |
34
+ | **KL divergence** | 0.2474 | 0 *(by definition)* |
35
+ | **Refusals** | 4/100 | 100/100 |
36
 
37
+ -----
38
 
 
39
 
40
+ <div align="center">
41
+ <img src=https://ai.google.dev/gemma/images/gemma4_banner.png>
42
+ </div>
43
 
 
44
 
45
+ <p align="center">
46
+ <a href="https://huggingface.co/collections/google/gemma-4" target="_blank">Hugging Face</a> |
47
+ <a href="https://github.com/google-gemma" target="_blank">GitHub</a> |
48
+ <a href="https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/" target="_blank">Launch Blog</a> |
49
+ <a href="https://ai.google.dev/gemma/docs/core" target="_blank">Documentation</a>
50
+ <br>
51
+ <b>License</b>: <a href="https://ai.google.dev/gemma/docs/gemma_4_license" target="_blank">Apache 2.0</a> | <b>Authors</b>: <a href="https://deepmind.google/models/gemma/" target="_blank">Google DeepMind</a>
52
+ </p>
53
 
54
+ Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on small models) and generating text output. This release includes open-weights models in both pre-trained and instruction-tuned variants. Gemma 4 features a context window of up to 256K tokens and maintains multilingual support in over 140 languages.
55
 
56
+ Featuring both Dense and Mixture-of-Experts (MoE) architectures, Gemma 4 is well-suited for tasks like text generation, coding, and reasoning. The models are available in four distinct sizes: **E2B**, **E4B**, **26B A4B**, and **31B**. Their diverse sizes make them deployable in environments ranging from high-end phones to laptops and servers, democratizing access to state-of-the-art AI.
57
 
58
+ Gemma 4 introduces key **capability and architectural advancements**:
 
 
59
 
60
+ * **Reasoning** – All models in the family are designed as highly capable reasoners, with configurable thinking modes.
61
 
62
+ * **Extended Multimodalities** Processes Text, Image with variable aspect ratio and resolution support (all models), Video, and Audio (featured natively on the E2B and E4B models).
63
 
64
+ * **Diverse & Efficient Architectures** – Offers Dense and Mixture-of-Experts (MoE) variants of different sizes for scalable deployment.
65
 
66
+ * **Optimized for On-Device** Smaller models are specifically designed for efficient local execution on laptops and mobile devices.
67
 
68
+ * **Increased Context Window** – The small models feature a 128K context window, while the medium models support 256K.
69
 
70
+ * **Enhanced Coding & Agentic Capabilities** – Achieves notable improvements in coding benchmarks alongside native function-calling support, powering highly capable autonomous agents.
71
 
72
+ * **Native System Prompt Support** Gemma 4 introduces native support for the `system` role, enabling more structured and controllable conversations.
73
 
74
+ ## **Models Overview**
75
 
76
+ Gemma 4 models are designed to deliver frontier-level performance at each size, targeting deployment scenarios from mobile and edge devices (E2B, E4B) to consumer GPUs and workstations (26B A4B, 31B). They are well-suited for reasoning, agentic workflows, coding, and multimodal understanding.
77
 
78
+ The models employ a hybrid attention mechanism that interleaves local sliding window attention with full global attention, ensuring the final layer is always global. This hybrid design delivers the processing speed and low memory footprint of a lightweight model without sacrificing the deep awareness required for complex, long-context tasks. To optimize memory for long contexts, global layers feature unified Keys and Values, and apply Proportional RoPE (p-RoPE).
79
 
80
+ ### Dense Models
81
 
82
+ | Property | E2B | E4B | 31B Dense |
83
+ | :---- | :---- | :---- | :---- |
84
+ | **Total Parameters** | 2.3B effective (5.1B with embeddings) | 4.5B effective (8B with embeddings) | 30.7B |
85
+ | **Layers** | 35 | 42 | 60 |
86
+ | **Sliding Window** | 512 tokens | 512 tokens | 1024 tokens |
87
+ | **Context Length** | 128K tokens | 128K tokens | 256K tokens |
88
+ | **Vocabulary Size** | 262K | 262K | 262K |
89
+ | **Supported Modalities** | Text, Image, Audio | Text, Image, Audio | Text, Image |
90
+ | **Vision Encoder Parameters** | *~150M* | *~150M* | *~550M* |
91
+ | **Audio Encoder Parameters** | *~300M* | *~300M* | No Audio |
92
 
93
+ The "E" in E2B and E4B stands for "effective" parameters. The smaller models incorporate Per-Layer Embeddings (PLE) to maximize parameter efficiency in on-device deployments. Rather than adding more layers or parameters to the model, PLE gives each decoder layer its own small embedding for every token. These embedding tables are large but are only used for quick lookups, which is why the effective parameter count is much smaller than the total.
94
 
95
+ ### Mixture-of-Experts (MoE) Model
96
 
97
+ | Property | 26B A4B MoE |
98
+ | :---- | :---- |
99
+ | **Total Parameters** | 25.2B |
100
+ | **Active Parameters** | 3.8B |
101
+ | **Layers** | 30 |
102
+ | **Sliding Window** | 1024 tokens |
103
+ | **Context Length** | 256K tokens |
104
+ | **Vocabulary Size** | 262K |
105
+ | **Expert Count** | 8 active / 128 total and 1 shared |
106
+ | **Supported Modalities** | Text, Image |
107
+ | **Vision Encoder Parameters** | *~550M* |
108
 
109
+ The "A" in 26B A4B stands for "active parameters" in contrast to the total number of parameters the model contains. By only activating a 4B subset of parameters during inference, the Mixture-of-Experts model runs much faster than its 26B total might suggest. This makes it an excellent choice for fast inference compared to the dense 31B model since it runs almost as fast as a 4B-parameter model.
110
 
111
+ ## **Benchmark Results**
112
 
113
+ These models were evaluated against a large collection of different datasets and metrics to cover different aspects of text generation. Evaluation results marked in the table are for instruction-tuned models.
114
 
115
+ | | Gemma 4 31B | Gemma 4 26B A4B | Gemma 4 E4B | Gemma 4 E2B | Gemma 3 27B (no think) |
116
+ | :---- | :---- | :---- | :---- | :---- | :---- |
117
+ | MMLU Pro | 85.2% | 82.6% | 69.4% | 60.0% | 67.6% |
118
+ | AIME 2026 no tools | 89.2% | 88.3% | 42.5% | 37.5% | 20.8% |
119
+ | LiveCodeBench v6 | 80.0% | 77.1% | 52.0% | 44.0% | 29.1% |
120
+ | Codeforces ELO | 2150 | 1718 | 940 | 633 | 110 |
121
+ | GPQA Diamond | 84.3% | 82.3% | 58.6% | 43.4% | 42.4% |
122
+ | Tau2 (average over 3) | 76.9% | 68.2% | 42.2% | 24.5% | 16.2% |
123
+ | HLE no tools | 19.5% | 8.7% | - | - | - |
124
+ | HLE with search | 26.5% | 17.2% | - | - | - |
125
+ | BigBench Extra Hard | 74.4% | 64.8% | 33.1% | 21.9% | 19.3% |
126
+ | MMMLU | 88.4% | 86.3% | 76.6% | 67.4% | 70.7% |
127
+ | **Vision** | | | | | |
128
+ | MMMU Pro | 76.9% | 73.8% | 52.6% | 44.2% | 49.7% |
129
+ | OmniDocBench 1.5 (average edit distance, lower is better) | 0.131 | 0.149 | 0.181 | 0.290 | 0.365 |
130
+ | MATH-Vision | 85.6% | 82.4% | 59.5% | 52.4% | 46.0% |
131
+ | MedXPertQA MM | 61.3% | 58.1% | 28.7% | 23.5% | - |
132
+ | **Audio** | | | | | |
133
+ | CoVoST | - | - | 35.54 | 33.47 | - |
134
+ | FLEURS (lower is better) | - | - | 0.08 | 0.09 | - |
135
+ | **Long Context** | | | | | |
136
+ | MRCR v2 8 needle 128k (average) | 66.4% | 44.1% | 25.4% | 19.1% | 13.5% |
137
 
138
+ ## **Core Capabilities**
139
 
140
+ Gemma 4 models handle a broad range of tasks across text, vision, and audio. Key capabilities include:
141
 
142
+ * **Thinking** – Built-in reasoning mode that lets the model think step-by-step before answering.
143
+ * **Long Context** – Context windows of up to 128K tokens (E2B/E4B) and 256K tokens (26B A4B/31B).
144
+ * **Image Understanding** – Object detection, Document/PDF parsing, screen and UI understanding, chart comprehension, OCR (including multilingual), handwriting recognition, and pointing. Images can be processed at variable aspect ratios and resolutions.
145
+ * **Video Understanding** – Analyze video by processing sequences of frames.
146
+ * **Interleaved Multimodal Input** – Freely mix text and images in any order within a single prompt.
147
+ * **Function Calling** – Native support for structured tool use, enabling agentic workflows.
148
+ * **Coding** – Code generation, completion, and correction.
149
+ * **Multilingual** – Out-of-the-box support for 35+ languages, pre-trained on 140+ languages.
150
+ * **Audio** (E2B and E4B only) – Automatic speech recognition (ASR) and speech-to-translated-text translation across multiple languages.
151
 
152
+ ## Getting Started
153
 
154
+ You can use all Gemma 4 models with the latest version of Transformers. To get started, install the necessary dependencies in your environment:
155
 
156
+ `pip install -U transformers torch accelerate`
157
 
158
+ Once you have everything installed, you can proceed to load the model with the code below:
159
 
160
+ ```python
161
+ from transformers import AutoProcessor, AutoModelForCausalLM
162
+
163
+ MODEL_ID = "google/gemma-4-26B-A4B-it"
164
+
165
+ # Load model
166
+ processor = AutoProcessor.from_pretrained(MODEL_ID)
167
+ model = AutoModelForCausalLM.from_pretrained(
168
+ MODEL_ID,
169
+ dtype="auto",
170
+ device_map="auto"
171
+ )
172
+ ```
173
 
174
+ Once the model is loaded, you can start generating output:
175
+
176
+ ```python
177
+ # Prompt
178
+ messages = [
179
+ {"role": "system", "content": "You are a helpful assistant."},
180
+ {"role": "user", "content": "Write a short joke about saving RAM."},
181
+ ]
182
+
183
+ # Process input
184
+ text = processor.apply_chat_template(
185
+ messages,
186
+ tokenize=False,
187
+ add_generation_prompt=True,
188
+ enable_thinking=False
189
+ )
190
+ inputs = processor(text=text, return_tensors="pt").to(model.device)
191
+ input_len = inputs["input_ids"].shape[-1]
192
 
193
+ # Generate output
194
+ outputs = model.generate(**inputs, max_new_tokens=1024)
195
+ response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
196
+
197
+ # Parse output
198
+ processor.parse_response(response)
199
+ ```
200
+
201
+ To enable reasoning, set `enable_thinking=True` and the `parse_response` function will take care of parsing the thinking output.
202
+
203
+ Below, you will also find snippets for processing audio (E2B and E4B only), images, and video alongside text:
204
+
205
+ <details>
206
+ <summary>Code for processing Audio</summary>
207
+
208
+ Instead of using `AutoModelForCausalLM`, you can use `AutoModelForMultimodalLM` to process audio. To use it, make sure to install the following packages:
209
+
210
+
211
+ `pip install -U transformers torch librosa accelerate`
212
+
213
+ You can then load the model with the code below:
214
+
215
+ ```python
216
+ from transformers import AutoProcessor, AutoModelForMultimodalLM
217
+
218
+ MODEL_ID = "google/gemma-4-E2B-it"
219
+
220
+ # Load model
221
+ processor = AutoProcessor.from_pretrained(MODEL_ID)
222
+ model = AutoModelForMultimodalLM.from_pretrained(
223
+ MODEL_ID,
224
+ dtype="auto",
225
+ device_map="auto"
226
+ )
227
+ ```
228
 
229
+ Once the model is loaded, you can start generating output by directly referencing the audio URL in the prompt:
230
 
 
231
 
232
+ ```python
233
+ # Prompt - add audio before text
234
+ messages = [
235
+ {
236
+ "role": "user",
237
+ "content": [
238
+ {"type": "audio", "audio": "https://raw.githubusercontent.com/google-gemma/cookbook/refs/heads/main/Demos/sample-data/journal1.wav"},
239
+ {"type": "text", "text": "Transcribe the following speech segment in its original language. Follow these specific instructions for formatting the answer:\n* Only output the transcription, with no newlines.\n* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three."},
240
+ ]
241
+ }
242
+ ]
243
 
244
+ # Process input
245
+ inputs = processor.apply_chat_template(
246
+ messages,
247
+ tokenize=True,
248
+ return_dict=True,
249
+ return_tensors="pt",
250
+ add_generation_prompt=True,
251
+ ).to(model.device)
252
+ input_len = inputs["input_ids"].shape[-1]
253
 
254
+ # Generate output
255
+ outputs = model.generate(**inputs, max_new_tokens=512)
256
+ response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
257
 
258
+ # Parse output
259
+ processor.parse_response(response)
260
+ ```
261
 
262
+ </details>
263
 
264
+ <details>
265
+ <summary>Code for processing Images</summary>
266
+
267
+ Instead of using `AutoModelForCausalLM`, you can use `AutoModelForMultimodalLM` to process images. To use it, make sure to install the following packages:
268
+
269
+
270
+ `pip install -U transformers torch torchvision accelerate`
271
 
272
+ You can then load the model with the code below:
273
 
274
+ ```python
275
+ from transformers import AutoProcessor, AutoModelForMultimodalLM
276
+
277
+ MODEL_ID = "google/gemma-4-26B-A4B-it"
278
 
279
+ # Load model
280
+ processor = AutoProcessor.from_pretrained(MODEL_ID)
281
+ model = AutoModelForMultimodalLM.from_pretrained(
282
+ MODEL_ID,
283
+ dtype="auto",
284
+ device_map="auto"
285
+ )
286
+ ```
287
 
288
+ Once the model is loaded, you can start generating output by directly referencing the image URL in the prompt:
289
 
 
290
 
291
+ ```python
292
+ # Prompt - add image before text
293
+ messages = [
294
+ {
295
+ "role": "user", "content": [
296
+ {"type": "image", "url": "https://raw.githubusercontent.com/google-gemma/cookbook/refs/heads/main/Demos/sample-data/GoldenGate.png"},
297
+ {"type": "text", "text": "What is shown in this image?"}
298
+ ]
299
+ }
300
+ ]
301
 
302
+ # Process input
303
+ inputs = processor.apply_chat_template(
304
+ messages,
305
+ tokenize=True,
306
+ return_dict=True,
307
+ return_tensors="pt",
308
+ add_generation_prompt=True,
309
+ ).to(model.device)
310
+ input_len = inputs["input_ids"].shape[-1]
311
 
312
+ # Generate output
313
+ outputs = model.generate(**inputs, max_new_tokens=512)
314
+ response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
315
 
316
+ # Parse output
317
+ processor.parse_response(response)
318
+ ```
319
 
320
+ </details>
321
 
 
322
 
323
+ <details>
324
+ <summary>Code for processing Videos</summary>
325
 
326
+ Instead of using `AutoModelForCausalLM`, you can use `AutoModelForMultimodalLM` to process videos. To use it, make sure to install the following packages:
327
 
328
+ `pip install -U transformers torch torchvision torchcodec librosa accelerate`
329
 
330
+ You can then load the model with the code below:
331
 
332
+ ```python
333
+ from transformers import AutoProcessor, AutoModelForMultimodalLM
334
 
335
+ MODEL_ID = "google/gemma-4-26B-A4B-it"
336
 
337
+ # Load model
338
+ processor = AutoProcessor.from_pretrained(MODEL_ID)
339
+ model = AutoModelForMultimodalLM.from_pretrained(
340
+ MODEL_ID,
341
+ dtype="auto",
342
+ device_map="auto"
343
+ )
344
+ ```
345
 
346
+ Once the model is loaded, you can start generating output by directly referencing the video URL in the prompt:
347
 
 
348
 
349
+ ```python
350
+ # Prompt - add video before text
351
+ messages = [
352
+ {
353
+ 'role': 'user',
354
+ 'content': [
355
+ {"type": "video", "video": "https://github.com/bebechien/gemma/raw/refs/heads/main/videos/ForBiggerBlazes.mp4"},
356
+ {'type': 'text', 'text': 'Describe this video.'}
357
+ ]
358
+ }
359
+ ]
360
 
361
+ # Process input
362
+ inputs = processor.apply_chat_template(
363
+ messages,
364
+ tokenize=True,
365
+ return_dict=True,
366
+ return_tensors="pt",
367
+ add_generation_prompt=True,
368
+ ).to(model.device)
369
+ input_len = inputs["input_ids"].shape[-1]
370
 
371
+ # Generate output
372
+ outputs = model.generate(**inputs, max_new_tokens=512)
373
+ response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
374
 
375
+ # Parse output
376
+ processor.parse_response(response)
377
+ ```
378
 
379
+ </details>
380
 
 
381
 
382
+ ## **Best Practices**
383
 
384
+ For the best performance, use these configurations and best practices:
385
 
386
+ ### 1. Sampling Parameters
387
 
388
+ Use the following standardized sampling configuration across all use cases:
389
 
390
+ * `temperature=1.0`
391
+ * `top_p=0.95`
392
+ * `top_k=64`
393
 
394
+ ### 2. Thinking Mode Configuration
395
 
396
+ Compared to Gemma 3, the models use standard `system`, `assistant`, and `user` roles. To properly manage the thinking process, use the following control tokens:
397
 
398
+ * **Trigger Thinking:** Thinking is enabled by including the `<|think|>` token at the start of the system prompt. To disable thinking, remove the token.
399
+ * **Standard Generation:** When thinking is enabled, the model will output its internal reasoning followed by the final answer using this structure:
400
+ `<|channel>thought\n`**[Internal reasoning]**`<channel|>`
401
+ * **Disabled Thinking Behavior:** For all models except for the E2B and E4B variants, if thinking is disabled, the model will still generate the tags but with an empty thought block:
402
+ `<|channel>thought\n<channel|>`**[Final answer]**
403
 
404
+ > [!Note]
405
+ > Note that many libraries like Transformers and llama.cpp handle the complexities of the chat template for you.
406
 
407
+ ### 3. Multi-Turn Conversations
408
 
409
+ * **No Thinking Content in History**: In multi-turn conversations, the historical model output should only include the final response. Thoughts from previous model turns must *not be added* before the next user turn begins.
410
 
411
+ ### 4. Modality order
412
 
413
+ * For optimal performance with multimodal inputs, place image and/or audio content **before** the text in your prompt.
414
 
415
+ ### 5. Variable Image Resolution
416
 
417
+ Aside from variable aspect ratios, Gemma 4 supports variable image resolution through a configurable visual token budget, which controls how many tokens are used to represent an image. A higher token budget preserves more visual detail at the cost of additional compute, while a lower budget enables faster inference for tasks that don't require fine-grained understanding.
418
 
419
+ * The supported token budgets are: **70**, **140**, **280**, **560**, and **1120**.
420
+ * Use *lower budgets* for classification, captioning, or video understanding, where faster inference and processing many frames outweigh fine-grained detail.
421
+ * Use *higher budgets* for tasks like OCR, document parsing, or reading small text.
422
 
423
+ ### 6. Audio
424
 
425
+ Use the following prompt structures for audio processing:
426
 
427
+ * **Audio Speech Recognition (ASR)**
428
+
429
+ ```text
430
+ Transcribe the following speech segment in {LANGUAGE} into {LANGUAGE} text.
431
+
432
+ Follow these specific instructions for formatting the answer:
433
+ * Only output the transcription, with no newlines.
434
+ * When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three.
435
+ ```
436
+
437
+ * **Automatic Speech Translation (AST)**
438
+
439
+ ```text
440
+ Transcribe the following speech segment in {SOURCE_LANGUAGE}, then translate it into {TARGET_LANGUAGE}.
441
+ When formatting the answer, first output the transcription in {SOURCE_LANGUAGE}, then one newline, then output the string '{TARGET_LANGUAGE}: ', then the translation in {TARGET_LANGUAGE}.
442
+ ```
443
+
444
+ ### 7. Audio and Video Length
445
+
446
+ All models support image inputs and can process videos as frames whereas the E2B and E4B models also support audio inputs. Audio supports a maximum length of 30 seconds. Video supports a maximum of 60 seconds assuming the images are processed at one frame per second.
447
+
448
+ ## **Model Data**
449
+
450
+ Data used for model training and how the data was processed.
451
+
452
+ ### **Training Dataset**
453
+
454
+ Our pre-training dataset is a large-scale, diverse collection of data encompassing a wide range of domains and modalities, which includes web documents, code, images, audio, with a cutoff date of January 2025. Here are the key components:
455
+
456
+ * **Web Documents**: A diverse collection of web text ensures the model is exposed to a broad range of linguistic styles, topics, and vocabulary. The training dataset includes content in over 140 languages.
457
+ * **Code**: Exposing the model to code helps it to learn the syntax and patterns of programming languages, which improves its ability to generate code and understand code-related questions.
458
+ * **Mathematics**: Training on mathematical text helps the model learn logical reasoning, symbolic representation, and to address mathematical queries.
459
+ * **Images**: A wide range of images enables the model to perform image analysis and visual data extraction tasks.
460
+
461
+ The combination of these diverse data sources is crucial for training a powerful multimodal model that can handle a wide variety of different tasks and data formats.
462
+
463
+ ### **Data Preprocessing**
464
+
465
+ Here are the key data cleaning and filtering methods applied to the training data:
466
+
467
+ * **CSAM Filtering**: Rigorous CSAM (Child Sexual Abuse Material) filtering was applied at multiple stages in the data preparation process to ensure the exclusion of harmful and illegal content.
468
+ * **Sensitive Data Filtering**: As part of making Gemma pre-trained models safe and reliable, automated techniques were used to filter out certain personal information and other sensitive data from training sets.
469
+ * **Additional methods**: Filtering based on content quality and safety in line with [our policies](https://ai.google/static/documents/ai-responsibility-update-published-february-2025.pdf).
470
+
471
+ ## **Ethics and Safety**
472
+
473
+ As open models become central to enterprise infrastructure, provenance and security are paramount. Developed by Google DeepMind, Gemma 4 undergoes the same rigorous safety evaluations as our proprietary Gemini models.
474
+
475
+ ### **Evaluation Approach**
476
+
477
+ Gemma 4 models were developed in partnership with internal safety and responsible AI teams. A range of automated as well as human evaluations were conducted to help improve model safety. These evaluations align with [Google’s AI principles](https://ai.google/principles/), as well as safety policies, which aim to prevent our generative AI models from generating harmful content, including:
478
+
479
+ * Content related to child sexual abuse material and exploitation
480
+ * Dangerous content (e.g., promoting suicide, or instructing in activities that could cause real-world harm)
481
+ * Sexually explicit content
482
+ * Hate speech (e.g., dehumanizing members of protected groups)
483
+ * Harassment (e.g., encouraging violence against people)
484
+
485
+ ### **Evaluation Results**
486
+
487
+ For all areas of safety testing, we saw major improvements in all categories of content safety relative to previous Gemma models. Overall, Gemma 4 models significantly outperform Gemma 3 and 3n models in improving safety, while keeping unjustified refusals low. All testing was conducted without safety filters to evaluate the model capabilities and behaviors. For both text-to-text and image-to-text, and across all model sizes, the model produced minimal policy violations, and showed significant improvements over previous Gemma models' performance.
488
+
489
+ ## **Usage and Limitations**
490
+
491
+ These models have certain limitations that users should be aware of.
492
+
493
+ ### **Intended Usage**
494
+
495
+ Multimodal models (capable of processing vision, language, and/or audio) have a wide range of applications across various industries and domains. The following list of potential uses is not comprehensive. The purpose of this list is to provide contextual information about the possible use-cases that the model creators considered as part of model training and development.
496
+
497
+ * **Content Creation and Communication**
498
+ * **Text Generation**: These models can be used to generate creative text formats such as poems, scripts, code, marketing copy, and email drafts.
499
+ * **Chatbots and Conversational AI**: Power conversational interfaces for customer service, virtual assistants, or interactive applications.
500
+ * **Text Summarization**: Generate concise summaries of a text corpus, research papers, or reports.
501
+ * **Image Data Extraction**: These models can be used to extract, interpret, and summarize visual data for text communications.
502
+ * **Audio Processing and Interaction**: The smaller models (E2B and E4B) can analyze and interpret audio inputs, enabling voice-driven interactions and transcriptions.
503
+ * **Research and Education**
504
+ * **Natural Language Processing (NLP) and VLM Research**: These models can serve as a foundation for researchers to experiment with VLM and NLP techniques, develop algorithms, and contribute to the advancement of the field.
505
+ * **Language Learning Tools**: Support interactive language learning experiences, aiding in grammar correction or providing writing practice.
506
+ * **Knowledge Exploration**: Assist researchers in exploring large bodies of text by generating summaries or answering questions about specific topics.
507
+
508
+ ### **Limitations**
509
+
510
+ * **Training Data**
511
+ * The quality and diversity of the training data significantly influence the model's capabilities. Biases or gaps in the training data can lead to limitations in the model's responses.
512
+ * The scope of the training dataset determines the subject areas the model can handle effectively.
513
+ * **Context and Task Complexity**
514
+ * Models perform well on tasks that can be framed with clear prompts and instructions. Open-ended or highly complex tasks might be challenging.
515
+ * A model's performance can be influenced by the amount of context provided (longer context generally leads to better outputs, up to a certain point).
516
+ * **Language Ambiguity and Nuance**
517
+ * Natural language is inherently complex. Models might struggle to grasp subtle nuances, sarcasm, or figurative language.
518
+ * **Factual Accuracy**
519
+ * Models generate responses based on information they learned from their training datasets, but they are not knowledge bases. They may generate incorrect or outdated factual statements.
520
+ * **Common Sense**
521
+ * Models rely on statistical patterns in language. They might lack the ability to apply common sense reasoning in certain situations.
522
+
523
+ ### **Ethical Considerations and Risks**
524
+
525
+ The development of vision-language models (VLMs) raises several ethical concerns. In creating an open model, we have carefully considered the following:
526
+
527
+ * **Bias and Fairness**
528
+ * VLMs trained on large-scale, real-world text and image data can reflect socio-cultural biases embedded in the training material. Gemma 4 models underwent careful scrutiny, input data pre-processing, and post-training evaluations as reported in this card to help mitigate the risk of these biases.
529
+ * **Misinformation and Misuse**
530
+ * VLMs can be misused to generate text that is false, misleading, or harmful.
531
+ * Guidelines are provided for responsible use with the model, see the [Responsible Generative AI Toolkit](https://ai.google.dev/responsible).
532
+ * **Transparency and Accountability**
533
+ * This model card summarizes details on the models' architecture, capabilities, limitations, and evaluation processes.
534
+ * A responsibly developed open model offers the opportunity to share innovation by making VLM technology accessible to developers and researchers across the AI ecosystem.
535
+
536
+ **Risks identified and mitigations**:
537
+
538
+ * **Generation of harmful content**: Mechanisms and guidelines for content safety are essential. Developers are encouraged to exercise caution and implement appropriate content safety safeguards based on their specific product policies and application use cases.
539
+ * **Misuse for malicious purposes**: Technical limitations and developer and end-user education can help mitigate against malicious applications of VLMs. Educational resources and reporting mechanisms for users to flag misuse are provided.
540
+ * **Privacy violations**: Models were trained on data filtered for removal of certain personal information and other sensitive data. Developers are encouraged to adhere to privacy regulations with privacy-preserving techniques.
541
+ * **Perpetuation of biases**: It's encouraged to perform continuous monitoring (using evaluation metrics, human review) and the exploration of de-biasing techniques during model training, fine-tuning, and other use cases.
542
+
543
+ ### **Benefits**
544
+
545
+ At the time of release, this family of models provides high-performance open vision-language model implementations designed from the ground up for responsible AI development compared to similarly sized models.