Insight into the "weird" data.

#3
by espen96 - opened

So you mentioned that the data was weird and not well behaved? Would you mind elaborating on what happened?

This model is quite unusual being a dense ssm/transformer hybrid. While everyone is doing a ton of research into the 35 model, I am really curious what we might learn about this special one.
Especially as we might expect even more of them when the smaller 3.5 models release.

A couple quants have PPL below baseline. The Q8_0 has PPL slightly below BF16. There are multiple KLD/cross correlation/delta P statistics which are not always correlated with the PPL etc.

Often on a big MoE the PPL is more correlated with KLD and there is a clear monotonic decreasing quality with size more or less.

I've analyzed quants mostly in the 4~5bpw range and they are all somewhat comparable in general though some appear slightly better. But the lowest PPL are not the lowest KLD etc.

I may post some logs or data after digging a bit more, but for now just released the smol-IQ4_NL as some folks have interest in that type of quantization e.g. @Lockout and maybe @tarruda

hm, interesting. Your quant so far seems to be behaving as expected at least.
On my pop culture "token flip / confabulation" prompts they do not seem to be much worse for wear than q6 even.
I might note that while the 122B and 27B appear similar in many benchmarks, the 122B model is significantly better at dealing with Medium and Long tail knowledge.
It soundly beats them both, with and without reasoning.

It was also way better at writing GLSL shaders. the 35B model at Q8_0 was soundly beat by the 122B model at IQ4_XS, and the 27B was not as good but much closer to the 122B than the 35B while itself being UD Q4_K_XL at the time of the test.

My benchmark is sadly not yet ready for a proper test, having issues with generating the datasets. But I will certainly check to see if the probability distributions towards the factual answers differ in any significant way.
Measuring quants might be a bit beyond it without a full and diverse test set however...

All in all... I am glad we are starting to see some new, fresh data. All the old conventional wisdom is built on pure transformer dense models, it's about time we are seeing some public facing data on these new architectures eh?

Hey @urbergarm !
I don't have the ressources to get the .kld for the BF16, do you plan to release the KL-D values for Q8_0 and your smol-IQ4_NL to compare? I was waiting for your jump in expecting more numbers :)

I also notice that somehow PPL is better without imatrix. And even a 4.165bpw quant that can fit into 16GB VRAM with good context can achieve a PPL of 6.8931 +/- 0.04448, and actually work well in real tasks.

Owner

@owao

I've been out over the weekend and also waiting to see how things fall out after the unsloth qwen3 shakeup.

Here are some raw data from KLD stats on this model for various quants from last week (before UD shakeup assuming it effected the dense model). Many of these are just test quants I never released.

$ grep -E '(Cor|Mean.*KLD|Max.*KLD|99.0.*Δp)' kld*.log
kld-Qwen3.5-27B-bartowski-Q5_K_M.log:Cor(ln(PPL(Q)), ln(PPL(base))):  99.85%
kld-Qwen3.5-27B-bartowski-Q5_K_M.log:Mean    KLD:   0.005931 ±   0.002823
kld-Qwen3.5-27B-bartowski-Q5_K_M.log:Maximum KLD:  23.893120
kld-Qwen3.5-27B-bartowski-Q5_K_M.log:99.0%   Δp:  3.394%
kld-Qwen3.5-27B-derp-Q4_0.log:Cor(ln(PPL(Q)), ln(PPL(base))):  99.75%
kld-Qwen3.5-27B-derp-Q4_0.log:Mean    KLD:   0.009037 ±   0.001811
kld-Qwen3.5-27B-derp-Q4_0.log:Maximum KLD:  18.391321
kld-Qwen3.5-27B-derp-Q4_0.log:99.0%   Δp:  6.682%
kld-Qwen3.5-27B-IQ5_K.log:Cor(ln(PPL(Q)), ln(PPL(base))):  99.84%
kld-Qwen3.5-27B-IQ5_K.log:Mean    KLD:   0.006820 ±   0.002943
kld-Qwen3.5-27B-IQ5_K.log:Maximum KLD:  25.351452
kld-Qwen3.5-27B-IQ5_K.log:99.0%   Δp:  3.833%
kld-Qwen3.5-27B-Q4_0.log:Cor(ln(PPL(Q)), ln(PPL(base))):  99.77%
kld-Qwen3.5-27B-Q4_0.log:Mean    KLD:   0.007423 ±   0.001620
kld-Qwen3.5-27B-Q4_0.log:Maximum KLD:  16.334347
kld-Qwen3.5-27B-Q4_0.log:99.0%   Δp:  5.843%
kld-Qwen3.5-27B-Q8_0.log:Cor(ln(PPL(Q)), ln(PPL(base))):  99.93%
kld-Qwen3.5-27B-Q8_0.log:Mean    KLD:   0.001297 ±   0.000857
kld-Qwen3.5-27B-Q8_0.log:Maximum KLD:   8.743946
kld-Qwen3.5-27B-Q8_0.log:99.0%   Δp:  1.564%
kld-Qwen3.5-27B-testing-IQ4_K.log:Cor(ln(PPL(Q)), ln(PPL(base))):  99.85%
kld-Qwen3.5-27B-testing-IQ4_K.log:Mean    KLD:   0.006994 ±   0.001986
kld-Qwen3.5-27B-testing-IQ4_K.log:Maximum KLD:  20.218706
kld-Qwen3.5-27B-testing-IQ4_K.log:99.0%   Δp:  5.227%
kld-Qwen3.5-27B-testing-IQ5_K.log:Cor(ln(PPL(Q)), ln(PPL(base))):  99.83%
kld-Qwen3.5-27B-testing-IQ5_K.log:Mean    KLD:   0.007172 ±   0.003007
kld-Qwen3.5-27B-testing-IQ5_K.log:Maximum KLD:  25.653091
kld-Qwen3.5-27B-testing-IQ5_K.log:99.0%   Δp:  4.244%
kld-Qwen3.5-27B-smol-IQ4_KSS.log:Cor(ln(PPL(Q)), ln(PPL(base))):  99.72%
kld-Qwen3.5-27B-smol-IQ4_KSS.log:Mean    KLD:   0.012258 ±   0.002887
kld-Qwen3.5-27B-smol-IQ4_KSS.log:Maximum KLD:  25.531702
kld-Qwen3.5-27B-smol-IQ4_KSS.log:99.0%   Δp:  6.415%
kld-Qwen3.5-27B-smol-IQ4_KT.log:Cor(ln(PPL(Q)), ln(PPL(base))):  99.76%
kld-Qwen3.5-27B-smol-IQ4_KT.log:Mean    KLD:   0.008338 ±   0.001371
kld-Qwen3.5-27B-smol-IQ4_KT.log:Maximum KLD:  13.889874
kld-Qwen3.5-27B-smol-IQ4_KT.log:99.0%   Δp:  6.159%
kld-Qwen3.5-27B-smol-IQ4_NL.log:Cor(ln(PPL(Q)), ln(PPL(base))):  99.75%
kld-Qwen3.5-27B-smol-IQ4_NL.log:Mean    KLD:   0.008401 ±   0.001727
kld-Qwen3.5-27B-smol-IQ4_NL.log:Maximum KLD:  17.546616
kld-Qwen3.5-27B-smol-IQ4_NL.log:99.0%   Δp:  6.049%
kld-Qwen3.5-27B-unsloth-Q4_K_M.log:Cor(ln(PPL(Q)), ln(PPL(base))):  99.77%
kld-Qwen3.5-27B-unsloth-Q4_K_M.log:Mean    KLD:   0.009447 ±   0.001670
kld-Qwen3.5-27B-unsloth-Q4_K_M.log:Maximum KLD:  16.969128
kld-Qwen3.5-27B-unsloth-Q4_K_M.log:99.0%   Δp:  6.628%
kld-Qwen3.5-27B-unsloth-UD-Q5_K_XL.log:Cor(ln(PPL(Q)), ln(PPL(base))):  99.88%
kld-Qwen3.5-27B-unsloth-UD-Q5_K_XL.log:Mean    KLD:   0.004227 ±   0.001534
kld-Qwen3.5-27B-unsloth-UD-Q5_K_XL.log:Maximum KLD:  15.644795
kld-Qwen3.5-27B-unsloth-UD-Q5_K_XL.log:99.0%   Δp:  3.990%

For comparison, a preliminary sweep on a small custom dataset (8 chunks -c 4096), validation corpus is chat-wrapped (Qwen3.5 format with <|im_start|> tokens)
BF16: Final estimate: PPL = 6.2951 +/- 0.13521

ppl_plot_Qwen3.5-27B
kld_plot_Qwen3.5-27B

I'm doing the logits for a new run with wikitext2, plain text, I'll see if it's coherent.

edit: those results have been taken after the latest wave of quant update, only lmstudio have yet to fix them. I haven't included DevQuasar since not only they haven't updated them but one of their quant is mxfp4 (which result in a Q8_0 when it's not an MoE), I haven't included dinerburger either since the quant is relatively massive (20.2gb).

Well, results mid-test are already so different trend wise between wikitext2 and the tidy-tiny validation dataset that I'll simply discard this one.

I'll redo everything with a larger ~50 chunks chat-wrapped dataset and include the 72 chunks wikitext2 as a comparison too.

local. I am putting together a file to let people check for themselves.
no funny math here, just straight check to see how different things are.

If I can, I'll do!

https://github.com/ggml-org/llama.cpp/issues/21532

The html file with the verification setup is in the issue

fwiw regarding gemma-4 models, i've still not quantized it yet because it seems like various tokenizer issues which seem to effect imatrix calculation. so i don't want to release quants then have to re-do them because i need to re-compute imatrix again.

this comment on a recent PR highlights that the perplexity results have changed again: https://github.com/ggml-org/llama.cpp/pull/21500#issuecomment-4193534949

@espen96

I looked at your GH issue there, might be useful to put the full llama-server ... command, though i do see you suggest it is using a fixed seed which is good.

Not sure which version of llama.cpp was used to create your test quants, might want to link exact quant used in your test along with that information as well (e.g. was it made with imatrix etc too).

yea, seems to be an unsloth issue perhaps. bartowski's quants from after he uploaded his fixes, after the main template and tokenizer changes, seems to be fine.
My unsloth quants are from when they pushed their updates seem to be borked, so from the 4th.

Seems the official ggml q4 is fine as well. the launch day bf16 from unsloth is also affected.
No clue about their current quants. They keep pushing updates without any changelog, who knows why or what changed, or when to grab new ones.
is it "template fixes"? new imatrix? new recipie? who knows!

Yea, would be quite a foul if my setup didn't use a fixed seed.

ok,, they found it. Issue with GEMV fusion.

https://github.com/ggml-org/llama.cpp/pull/21566

So all the BF and F16 quants I tried were affected, Unsloth, ggml, Bartowski. But naturally I only had one around to start. I grabbed Bartowski's after the F16 quant from ggml failed, to rule out issues with the format.
However, all quants appear to be working now, the ones I checked to verify the fix at the very least seem to be fine. Same result every time.

no way I can ever measure the 31B Gemma 4.

But:
https://localbench.substack.com/p/gemma-4-31b-gguf-kl-divergence
this seems like a very good writeup. I do wish they had more categories?

Coding
General chat
Tool calling
Science
Non-Latin scripts
Long documents

Great list, but I would have liked to see some pop culture stuff with focus on easier and harder data, but this is great. even got long documents! I have not gotten that far yet.
They got top-1 and kld. I would have liked to see more than just the top choice. LLMs are not greedy, got to pay attention to the top options, the top-1 is no good if the model is considering multiple tokens strongly.

So it does appear that the 31B, according to this, is quantizing as expected. everything is mostly fine until you drop into the knee where q4 quants start to reveal the damage, and it is just worse and worse from there on.

I had a feeling about somehting.

remember the 100% divergences?

here is some data on Gemma.

image

qwen 35b

image

ministral 8b.

image

llama 3.1 8b

image

Note one thing. The replacement tokens appear to come from some unknown place, complete stranger.
For Qwen and Gemma it is often a token with very very high probability.

This is using the "kneedle" algorithm to find the cutoff.
so this is saying, that most of the time the models own natural boundary, when it stops considering tokens. is breached, we are picking some "random" token the model never considered before.

And for Gemma, and qwen, the rank of the candidate we just replaced with some random token is, the top one.
it is not replaced by the runner up, or some other top 10 choice, no, some outsider.

Sign up or log in to comment