Instructions to use google/gemma-4-12B-it-qat-w4a16-ct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use google/gemma-4-12B-it-qat-w4a16-ct with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("google/gemma-4-12B-it-qat-w4a16-ct") model = AutoModelForMultimodalLM.from_pretrained("google/gemma-4-12B-it-qat-w4a16-ct") - Notebooks
- Google Colab
- Kaggle
Tested it and its the Bomb! Vram efficient with high Quality.
"TIGHT! TIGHT, TIGHT! YEAH! Oh, Q4_0,Q3_0, Q1, whatever, man! Just keep bringing me that!"
~Thank You Google engineer who didn't sleep for days or weeks to cook this magnificent Quant aware, I can finally fit it into my bot and bird watcher and process data and image super fast.
vllm command ?
Hi @JanjanJean , Thank you for the feedback!
@mohamedemam , If you're looking to run it with vLLM, try this command:
vllm serve google/gemma-4-12B-it-qat-w4a16-ct \
--max-model-len 32768 \
--gpu-memory-utilization 0.90
Refer to this documentation for more details. Thanks!
i tried it recently but weird bug appear 'Gemma4UnifiedVisionConfig' object has no attribute 'num_soft_tokens'
which version do you use?
yeah.... I hit the num_soft_tokens problem as well.... blaaaaa:
WARNING 06-11 02:47:09 [interface.py:757] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(APIServer pid=1) INFO 06-11 02:47:09 [api_utils.py:339]
(APIServer pid=1) INFO 06-11 02:47:09 [api_utils.py:339] █ █ █▄ ▄█
(APIServer pid=1) INFO 06-11 02:47:09 [api_utils.py:339] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.22.1rc1.dev332+g2c9c07c85
(APIServer pid=1) INFO 06-11 02:47:09 [api_utils.py:339] █▄█▀ █ █ █ █ model google/gemma-4-12B-it-qat-w4a16-ct
(APIServer pid=1) INFO 06-11 02:47:09 [api_utils.py:339] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=1) INFO 06-11 02:47:09 [api_utils.py:339]
(APIServer pid=1) INFO 06-11 02:47:09 [api_utils.py:273] non-default args: {'model_tag': 'google/gemma-4-12B-it-qat-w4a16-ct', 'model': 'google/gemma-4-12B-it-qat-w4a16-ct', 'trust_remote_code': True, 'max_model_len': 64000, 'quantization': 'compressed-tensors', 'served_model_name': ['gem4'], 'max_num_batched_tokens': 4096}
(APIServer pid=1) WARNING 06-11 02:47:09 [envs.py:2111] Unknown vLLM environment variable detected: VLLM_BUILD_COMMIT
(APIServer pid=1) WARNING 06-11 02:47:09 [envs.py:2111] Unknown vLLM environment variable detected: VLLM_BUILD_PIPELINE
(APIServer pid=1) WARNING 06-11 02:47:09 [envs.py:2111] Unknown vLLM environment variable detected: VLLM_BUILD_URL
(APIServer pid=1) WARNING 06-11 02:47:09 [envs.py:2111] Unknown vLLM environment variable detected: VLLM_IMAGE_TAG
(APIServer pid=1) :1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(APIServer pid=1) :1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
(APIServer pid=1) INFO 06-11 02:47:18 [model.py:611] Resolved architecture: Gemma4UnifiedForConditionalGeneration
(APIServer pid=1) INFO 06-11 02:47:18 [model.py:1745] Using max model len 64000
(APIServer pid=1) INFO 06-11 02:47:19 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=4096.
(APIServer pid=1) INFO 06-11 02:47:19 [config.py:99] Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512). FA4 not available, forcing TRITON_ATTN backend.
(APIServer pid=1) INFO 06-11 02:47:19 [vllm.py:999] Asynchronous scheduling is enabled.
(APIServer pid=1) INFO 06-11 02:47:19 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
(APIServer pid=1) WARNING 06-11 02:47:19 [cuda.py:243] Forcing --disable_chunked_mm_input for models with multimodal-bidirectional attention.
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1) File "/usr/local/bin/vllm", line 10, in
(APIServer pid=1) sys.exit(main())
(APIServer pid=1) ^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 95, in main
(APIServer pid=1) args.dispatch_function(args)
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 148, in cmd
(APIServer pid=1) uvloop.run(run_server(args))
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 96, in run
(APIServer pid=1) return __asyncio.run(
(APIServer pid=1) ^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1) return runner.run(main)
(APIServer pid=1) ^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1) return self._loop.run_until_complete(task)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 48, in wrapper
(APIServer pid=1) return await main
(APIServer pid=1) ^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 665, in run_server
(APIServer pid=1) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 679, in run_server_worker
(APIServer pid=1) async with build_async_engine_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 99, in build_async_engine_client
(APIServer pid=1) async with build_async_engine_client_from_engine_args(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 135, in build_async_engine_client_from_engine_args
(APIServer pid=1) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 217, in from_vllm_config
(APIServer pid=1) return cls(
(APIServer pid=1) ^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 135, in init
(APIServer pid=1) self.input_processor = InputProcessor(self.vllm_config, renderer)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/input_processor.py", line 62, in init
(APIServer pid=1) mm_budget = MultiModalBudget(vllm_config, mm_registry)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/encoder_budget.py", line 87, in init
(APIServer pid=1) all_mm_max_toks_per_item = get_mm_max_toks_per_item(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/encoder_budget.py", line 25, in get_mm_max_toks_per_item
(APIServer pid=1) max_tokens_per_item = processor.info.get_mm_max_tokens_per_item(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4_unified.py", line 166, in get_mm_max_tokens_per_item
(APIServer pid=1) tokens_per_image = config.vision_config.num_soft_tokens
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/transformers/configuration_utils.py", line 434, in getattribute
(APIServer pid=1) return super().getattribute(key)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) AttributeError: 'Gemma4UnifiedVisionConfig' object has no attribute 'num_soft_tokens'
same, no luck running it with vllm
Hi guys, can you check it now? Please let us know if this fixes your scenario.
Ref : https://huggingface.co/google/gemma-4-12B-it-qat-w4a16-ct/discussions/5
I still can't get it to run with latest vllm nightly cuda 12.9 build:
RuntimeError: Shape mismatch: a.size(1) = 4096, size_k = 8192