Instructions to use morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M", filename="gemma-4-31B-it-L25L26x1.5-IQ1_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M:IQ1_M # Run inference directly in the terminal: llama-cli -hf morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M:IQ1_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M:IQ1_M # Run inference directly in the terminal: llama-cli -hf morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M:IQ1_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M:IQ1_M # Run inference directly in the terminal: ./llama-cli -hf morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M:IQ1_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M:IQ1_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M:IQ1_M
Use Docker
docker model run hf.co/morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M:IQ1_M
- LM Studio
- Jan
- vLLM
How to use morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M:IQ1_M
- Ollama
How to use morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M with Ollama:
ollama run hf.co/morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M:IQ1_M
- Unsloth Studio
How to use morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M to start chatting
- Pi
How to use morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M:IQ1_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M:IQ1_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M:IQ1_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M:IQ1_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M with Docker Model Runner:
docker model run hf.co/morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M:IQ1_M
- Lemonade
How to use morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M:IQ1_M
Run and chat with the model
lemonade run user.gemma-4-31B-it-L25L26x1.5-IQ1_M-IQ1_M
List all available models
lemonade list
Gemma 4 31B-it IQ1_M + L25+L26 ×1.5 (8-byte F32 patch)
1-bit ULTRA-quantized Gemma 4 31B revived with an 8-byte F32 patch — GSM8k 24% → 60% (+36pt), HellaSwag +10.95pt, training-free.
A 1-bit (IQ1_M) quantized Gemma 4 31B-it model — only ~9.5 GB — with the minimal L25+L26 ×1.5 patch applied to 2 F32 layer_output_scale weights. Training-free, calibration-free, zero inference overhead.
TL;DR
Reading the GSM8k numbers: paper v1 reported +36pt at n=100 (ctx=1024). The n=500 ctx=16384 paper-grade re-validation was run on Q2_K (+5.40pt pure capability + +3.60pt token-budget convergence). IQ1_M was not re-validated at n=500 — the +36pt below mixes capability gain with token-budget efficiency at small n. Treat as paper v1 protocol number, directional rather than statistically confirmed.
| metric | baseline IQ1_M | L25+L26 ×1.5 patched | Δ |
|---|---|---|---|
| GSM8k (n=100, ctx=1024, paper v1 legacy) | 24.0% [16.69, 33.23] | 60.0% [50.20, 69.06] | +36.0pt ⭐⭐ (CIs separated 16.99pt, n=500 not re-validated) |
| HellaSwag (n=10042 full) | 42.02% [41.06, 42.99] | 52.98% [52.00, 53.95] | +10.95pt (CIs separated) |
| Winogrande (n=1267 full) | 49.80% | 55.56% | +5.76pt |
| ARC-Challenge (n=1165) | 30.56% | 36.74% | +6.18pt |
Striking result: GSM8k jumps from 24% to 60% — a +36 percentage point improvement, the largest single-cell gain in our 12-cell evaluation matrix. This is the strongest evidence that the F32 patch unlocks structural reasoning capacity rather than merely recovering quantization loss.
- Patch size: 8 bytes (2 layers × 4 bytes F32 scalar)
- Model size: ~9.5 GB
- Recommended for: CPU-only laptops, low-VRAM GPUs (≤8 GB)
- Same patch used uniformly for Q1/Q2/Q4 — a single 8-byte recipe across all release quantizations
What is L25+L26 ×1.5?
l25_l26_patch = {
25: 1.5,
26: 1.5,
}
Two multiplicative scales on F32 layer_output_scale weights at layers 25 and 26. The simplest possible patch that consistently unlocks capacity across all three release quantizations.
Mechanism (preliminary, revised): The patch scales the per-layer layer_output_scale — a single F32 scalar per transformer block that gates how much of that block's normalized output is written back to the residual stream. We multiply this gate by 1.5× at layers 25 and 26, amplifying their residual contribution. Why L25 and L26 specifically work remains open: structural analysis of the GGUF shows both are sliding-window (not full-attention) layers in the 5:1 hybrid pattern, contradicting our earlier "rare full-attention slack" framing. Cross-model checks (Gemma 4 +11pt HellaSwag, Qwen 3.6 +2.5pt, Phi-4 BF16 destructive Δ, Llama null) confirm the effect is hybrid-architecture-specific, but the mechanism connecting which layers respond and why remains future work.
Honest note: I tried the fancy way first
Before settling on L25+L26, I ran an in-house multi-specialist optimization engine (~7h on a single 32 GB consumer GPU) targeting Q4 HellaSwag. It found an 11-layer 44-byte patch called basin B. On Q1 GSM8k specifically, basin B was worse than baseline (-8 pt, dropping to 16%), while L25+L26 reached 60%. The simple 2-layer patch wins. The basin B values are kept in the paper appendix for transparency.
Files
gemma-4-31B-it-L25L26x1.5-IQ1_M.gguf(~9.5 GB)- MD5:
0985dd00c5408169d77cd4c3c021fda6
- MD5:
gemma-4-31B-it-L25L26x1.5-IQ1_M.gguf.md5apply_l25l26.pyREADME.md(this file)LICENSE
How to use
huggingface-cli download morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M \
--local-dir ./gemma
./llama-cli -m ./gemma/gemma-4-31B-it-L25L26x1.5-IQ1_M.gguf -ngl 99 -c 4096
Apply to your own GGUF
pip install gguf numpy
git clone https://github.com/morphicode-jp/f32-patch-gemma
python f32-patch-gemma/apply_l25l26.py /path/to/google_gemma-4-31b-it-IQ1_M.gguf
--restore undoes the patch via the auto-created .backup file.
Sister releases
morphicode-jp/gemma-4-31B-it-L25L26x1.5-Q2_K— 2-bit, ~12 GB, flagship low-specmorphicode-jp/gemma-4-31B-it-L25L26x1.5-Q4_K_M— 4-bit, ~19 GB, mainstream quality (beats Q8_0 BF16 baseline on all 4 benchmarks)
All three use the identical 8-byte L25+L26 ×1.5 patch.
Citation
@misc{hirai2026f32patch,
title = {Why Some LLMs Have a Hidden Reasoning Knob:
Rare Full-Attention Bottlenecks in Hybrid Architectures
and an 8-byte Quantization Recovery},
author = {Hirai, Akito},
year = {2026},
doi = {10.5281/zenodo.20362821},
url = {https://doi.org/10.5281/zenodo.20362821}
}
Limitations & Methodology Notes
- Patch values are calibrated for Gemma 4 31B; other Gemma sizes (9B, 27B) not tested.
- Cross-family transfer is weak (Qwen 3.6 +2.5pt; Phi-4 / Llama / Mistral null).
- Alignment was measured on Q2_K (L25+L26: 93.46% AdvBench refusal retention vs baseline 97.31%, a 3.85pt drop). Q1 alignment not separately measured.
- Scorer caveat: HellaSwag/Winogrande accuracies measured with
llama-perplexity --hellaswagmode, systematically 0.2–2.5 pp lower thanlm-evaluation-harnessstandard (llama.cpp discussion #2321). Within-scorer baseline-vs-patched deltas remain valid.
Contact
- X (Twitter): @morphicode_jp
- GitHub: github.com/morphicode-jp
- Zenodo (paper + code archive): doi.org/10.5281/zenodo.20362821
DMs open for research collaboration.
License: Apache 2.0 for the patch tooling. Gemma 4 base weights are licensed under Apache 2.0 (verified 2026-05-31; Gemma 4 was moved off the older Gemma Terms of Use). The patched-GGUF derivative notice is in LICENSE-WEIGHTS.
- Downloads last month
- 65
1-bit