Instructions to use morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M",
	filename="gemma-4-31B-it-L25L26x1.5-IQ1_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M:IQ1_M
# Run inference directly in the terminal:
llama-cli -hf morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M:IQ1_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M:IQ1_M
# Run inference directly in the terminal:
llama-cli -hf morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M:IQ1_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M:IQ1_M
# Run inference directly in the terminal:
./llama-cli -hf morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M:IQ1_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M:IQ1_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M:IQ1_M

Use Docker

docker model run hf.co/morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M:IQ1_M

LM Studio
Jan

vLLM

How to use morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M:IQ1_M

Ollama
How to use morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M with Ollama:
```
ollama run hf.co/morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M:IQ1_M
```

Unsloth Studio

How to use morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M to start chatting

How to use morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M:IQ1_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M:IQ1_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M:IQ1_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M:IQ1_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M with Docker Model Runner:
```
docker model run hf.co/morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M:IQ1_M
```

Lemonade

How to use morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M:IQ1_M

Run and chat with the model

lemonade run user.gemma-4-31B-it-L25L26x1.5-IQ1_M-IQ1_M

List all available models

lemonade list

Gemma 4 31B-it IQ1_M + L25+L26 ×1.5 (8-byte F32 patch)

1-bit ULTRA-quantized Gemma 4 31B revived with an 8-byte F32 patch — GSM8k 24% → 60% (+36pt), HellaSwag +10.95pt, training-free.

A 1-bit (IQ1_M) quantized Gemma 4 31B-it model — only ~9.5 GB — with the minimal L25+L26 ×1.5 patch applied to 2 F32 layer_output_scale weights. Training-free, calibration-free, zero inference overhead.

TL;DR

Reading the GSM8k numbers: paper v1 reported +36pt at n=100 (ctx=1024). The n=500 ctx=16384 paper-grade re-validation was run on Q2_K (+5.40pt pure capability + +3.60pt token-budget convergence). IQ1_M was not re-validated at n=500 — the +36pt below mixes capability gain with token-budget efficiency at small n. Treat as paper v1 protocol number, directional rather than statistically confirmed.

metric	baseline IQ1_M	L25+L26 ×1.5 patched	Δ
GSM8k (n=100, ctx=1024, paper v1 legacy)	24.0% [16.69, 33.23]	60.0% [50.20, 69.06]	+36.0pt ⭐⭐ (CIs separated 16.99pt, n=500 not re-validated)
HellaSwag (n=10042 full)	42.02% [41.06, 42.99]	52.98% [52.00, 53.95]	+10.95pt (CIs separated)
Winogrande (n=1267 full)	49.80%	55.56%	+5.76pt
ARC-Challenge (n=1165)	30.56%	36.74%	+6.18pt

Striking result: GSM8k jumps from 24% to 60% — a +36 percentage point improvement, the largest single-cell gain in our 12-cell evaluation matrix. This is the strongest evidence that the F32 patch unlocks structural reasoning capacity rather than merely recovering quantization loss.

Patch size: 8 bytes (2 layers × 4 bytes F32 scalar)
Model size: ~9.5 GB
Recommended for: CPU-only laptops, low-VRAM GPUs (≤8 GB)
Same patch used uniformly for Q1/Q2/Q4 — a single 8-byte recipe across all release quantizations

What is L25+L26 ×1.5?

l25_l26_patch = {
    25: 1.5,
    26: 1.5,
}

Two multiplicative scales on F32 layer_output_scale weights at layers 25 and 26. The simplest possible patch that consistently unlocks capacity across all three release quantizations.

Mechanism (preliminary, revised): The patch scales the per-layer layer_output_scale — a single F32 scalar per transformer block that gates how much of that block's normalized output is written back to the residual stream. We multiply this gate by 1.5× at layers 25 and 26, amplifying their residual contribution. Why L25 and L26 specifically work remains open: structural analysis of the GGUF shows both are sliding-window (not full-attention) layers in the 5:1 hybrid pattern, contradicting our earlier "rare full-attention slack" framing. Cross-model checks (Gemma 4 +11pt HellaSwag, Qwen 3.6 +2.5pt, Phi-4 BF16 destructive Δ, Llama null) confirm the effect is hybrid-architecture-specific, but the mechanism connecting which layers respond and why remains future work.

Honest note: I tried the fancy way first

Before settling on L25+L26, I ran an in-house multi-specialist optimization engine (~7h on a single 32 GB consumer GPU) targeting Q4 HellaSwag. It found an 11-layer 44-byte patch called basin B. On Q1 GSM8k specifically, basin B was worse than baseline (-8 pt, dropping to 16%), while L25+L26 reached 60%. The simple 2-layer patch wins. The basin B values are kept in the paper appendix for transparency.

Files

gemma-4-31B-it-L25L26x1.5-IQ1_M.gguf (~9.5 GB)
- MD5: 0985dd00c5408169d77cd4c3c021fda6
gemma-4-31B-it-L25L26x1.5-IQ1_M.gguf.md5
apply_l25l26.py
README.md (this file)
LICENSE

How to use

huggingface-cli download morphicode-jp/gemma-4-31B-it-L25L26x1.5-IQ1_M \
    --local-dir ./gemma
./llama-cli -m ./gemma/gemma-4-31B-it-L25L26x1.5-IQ1_M.gguf -ngl 99 -c 4096

Apply to your own GGUF

pip install gguf numpy
git clone https://github.com/morphicode-jp/f32-patch-gemma
python f32-patch-gemma/apply_l25l26.py /path/to/google_gemma-4-31b-it-IQ1_M.gguf

--restore undoes the patch via the auto-created .backup file.

Sister releases

morphicode-jp/gemma-4-31B-it-L25L26x1.5-Q2_K — 2-bit, ~12 GB, flagship low-spec
morphicode-jp/gemma-4-31B-it-L25L26x1.5-Q4_K_M — 4-bit, ~19 GB, mainstream quality (beats Q8_0 BF16 baseline on all 4 benchmarks)

All three use the identical 8-byte L25+L26 ×1.5 patch.

Citation

@misc{hirai2026f32patch,
  title  = {Why Some LLMs Have a Hidden Reasoning Knob:
            Rare Full-Attention Bottlenecks in Hybrid Architectures
            and an 8-byte Quantization Recovery},
  author = {Hirai, Akito},
  year   = {2026},
  doi    = {10.5281/zenodo.20362821},
  url    = {https://doi.org/10.5281/zenodo.20362821}
}

Limitations & Methodology Notes

Patch values are calibrated for Gemma 4 31B; other Gemma sizes (9B, 27B) not tested.
Cross-family transfer is weak (Qwen 3.6 +2.5pt; Phi-4 / Llama / Mistral null).
Alignment was measured on Q2_K (L25+L26: 93.46% AdvBench refusal retention vs baseline 97.31%, a 3.85pt drop). Q1 alignment not separately measured.
Scorer caveat: HellaSwag/Winogrande accuracies measured with llama-perplexity --hellaswag mode, systematically 0.2–2.5 pp lower than lm-evaluation-harness standard (llama.cpp discussion #2321). Within-scorer baseline-vs-patched deltas remain valid.

Contact

X (Twitter): @morphicode_jp
GitHub: github.com/morphicode-jp
Zenodo (paper + code archive): doi.org/10.5281/zenodo.20362821

DMs open for research collaboration.

License: Apache 2.0 for the patch tooling. Gemma 4 base weights are licensed under Apache 2.0 (verified 2026-05-31; Gemma 4 was moved off the older Gemma Terms of Use). The patched-GGUF derivative notice is in LICENSE-WEIGHTS.

Downloads last month: 65

GGUF

Model size

31B params

Architecture

gemma4

Hardware compatibility

1-bit