GGUF missing nextn tensors: embed_tokens and shared_head_head (llama.cpp fork fails to load)

#2
by miniwithmama - opened

Issue

The GGUF files (e.g., EXAONE-4.5-33B-Q4_K_M.gguf) cannot be loaded by the official forked llama.cpp (nuxlear/llama.cpp@add-exaone4_5) due to missing nextn tensors.

Error

model has unused tensor blk.64.ffn_gate.weight -- ignoring
model has unused tensor blk.64.ffn_down.weight -- ignoring
model has unused tensor blk.64.ffn_up.weight -- ignoring
model has unused tensor blk.64.post_ffw_norm.weight -- ignoring
llama_model_load: error loading model: missing tensor 'blk.%d.nextn.eh_proj'

Root Cause

The GGUF contains 4 nextn tensors in blk.64:

  • blk.64.nextn.eh_proj.weight βœ…
  • blk.64.nextn.enorm.weight βœ…
  • blk.64.nextn.hnorm.weight βœ…
  • blk.64.nextn.shared_head_norm.weight βœ…

But the forked llama.cpp code (src/llama-arch.cpp line 437-442) expects 6 nextn tensors:

  • blk.%d.nextn.eh_proj βœ…
  • blk.%d.nextn.embed_tokens ❌ MISSING
  • blk.%d.nextn.enorm βœ…
  • blk.%d.nextn.hnorm βœ…
  • blk.%d.nextn.shared_head_head ❌ MISSING
  • blk.%d.nextn.shared_head_norm βœ…

This causes a tensor count mismatch: expected 723, got 719 (4 missing = 2 tensors Γ— 2 nextn layers? or 4 tensor entries).

Environment

  • Hardware: NVIDIA DGX Spark GB10 (Blackwell SM121, 128GB)
  • llama.cpp fork: nuxlear/llama.cpp@add-exaone4_5 (commit 3b12fcd1)
  • GGUF: EXAONE-4.5-33B-Q4_K_M.gguf from this repo
  • mmproj: mmproj-EXAONE-4.5-33B-BF16.gguf included
  • Also tested with upstream llama.cpp (latest) β€” same error

Steps to Reproduce

git clone -b add-exaone4_5 https://github.com/nuxlear/llama.cpp
cd llama.cpp && cmake -B build -DGGML_CUDA=ON && cmake --build build -j$(nproc) --target llama-server

./build/bin/llama-server \
  -m EXAONE-4.5-33B-Q4_K_M.gguf \
  -mm mmproj-EXAONE-4.5-33B-BF16.gguf \
  -ngl 999 -c 8192 --port 8000 -a EXAONE-4.5-33B --jinja

Expected

Model loads successfully and serves via OpenAI-compatible API.

Actual

Model fails to load with missing tensor error.

Suggestion

Either:

  1. Include the missing embed_tokens and shared_head_head tensors in the GGUF conversion
  2. Or update the forked llama.cpp to make these tensors optional
LG AI Research org

Hello, @miniwithmama . Thank you for your contribution!

We found that some nextn tensors were missing in the architecture definition.
We added a hotfix commit to our fork and confirmed that the example code works properly.

Could you please try again after updating the repository?

LG AI Research org
This comment has been hidden (marked as Off-Topic)

Sign up or log in to comment