Insight into the "weird" data.
So you mentioned that the data was weird and not well behaved? Would you mind elaborating on what happened?
This model is quite unusual being a dense ssm/transformer hybrid. While everyone is doing a ton of research into the 35 model, I am really curious what we might learn about this special one.
Especially as we might expect even more of them when the smaller 3.5 models release.
A couple quants have PPL below baseline. The Q8_0 has PPL slightly below BF16. There are multiple KLD/cross correlation/delta P statistics which are not always correlated with the PPL etc.
Often on a big MoE the PPL is more correlated with KLD and there is a clear monotonic decreasing quality with size more or less.
I've analyzed quants mostly in the 4~5bpw range and they are all somewhat comparable in general though some appear slightly better. But the lowest PPL are not the lowest KLD etc.
I may post some logs or data after digging a bit more, but for now just released the smol-IQ4_NL as some folks have interest in that type of quantization e.g. @Lockout and maybe @tarruda
hm, interesting. Your quant so far seems to be behaving as expected at least.
On my pop culture "token flip / confabulation" prompts they do not seem to be much worse for wear than q6 even.
I might note that while the 122B and 27B appear similar in many benchmarks, the 122B model is significantly better at dealing with Medium and Long tail knowledge.
It soundly beats them both, with and without reasoning.
It was also way better at writing GLSL shaders. the 35B model at Q8_0 was soundly beat by the 122B model at IQ4_XS, and the 27B was not as good but much closer to the 122B than the 35B while itself being UD Q4_K_XL at the time of the test.
My benchmark is sadly not yet ready for a proper test, having issues with generating the datasets. But I will certainly check to see if the probability distributions towards the factual answers differ in any significant way.
Measuring quants might be a bit beyond it without a full and diverse test set however...
All in all... I am glad we are starting to see some new, fresh data. All the old conventional wisdom is built on pure transformer dense models, it's about time we are seeing some public facing data on these new architectures eh?
Hey @urbergarm !
I don't have the ressources to get the .kld for the BF16, do you plan to release the KL-D values for Q8_0 and your smol-IQ4_NL to compare? I was waiting for your jump in expecting more numbers :)
I also notice that somehow PPL is better without imatrix. And even a 4.165bpw quant that can fit into 16GB VRAM with good context can achieve a PPL of 6.8931 +/- 0.04448, and actually work well in real tasks.
I've been out over the weekend and also waiting to see how things fall out after the unsloth qwen3 shakeup.
Here are some raw data from KLD stats on this model for various quants from last week (before UD shakeup assuming it effected the dense model). Many of these are just test quants I never released.
$ grep -E '(Cor|Mean.*KLD|Max.*KLD|99.0.*Δp)' kld*.log
kld-Qwen3.5-27B-bartowski-Q5_K_M.log:Cor(ln(PPL(Q)), ln(PPL(base))): 99.85%
kld-Qwen3.5-27B-bartowski-Q5_K_M.log:Mean KLD: 0.005931 ± 0.002823
kld-Qwen3.5-27B-bartowski-Q5_K_M.log:Maximum KLD: 23.893120
kld-Qwen3.5-27B-bartowski-Q5_K_M.log:99.0% Δp: 3.394%
kld-Qwen3.5-27B-derp-Q4_0.log:Cor(ln(PPL(Q)), ln(PPL(base))): 99.75%
kld-Qwen3.5-27B-derp-Q4_0.log:Mean KLD: 0.009037 ± 0.001811
kld-Qwen3.5-27B-derp-Q4_0.log:Maximum KLD: 18.391321
kld-Qwen3.5-27B-derp-Q4_0.log:99.0% Δp: 6.682%
kld-Qwen3.5-27B-IQ5_K.log:Cor(ln(PPL(Q)), ln(PPL(base))): 99.84%
kld-Qwen3.5-27B-IQ5_K.log:Mean KLD: 0.006820 ± 0.002943
kld-Qwen3.5-27B-IQ5_K.log:Maximum KLD: 25.351452
kld-Qwen3.5-27B-IQ5_K.log:99.0% Δp: 3.833%
kld-Qwen3.5-27B-Q4_0.log:Cor(ln(PPL(Q)), ln(PPL(base))): 99.77%
kld-Qwen3.5-27B-Q4_0.log:Mean KLD: 0.007423 ± 0.001620
kld-Qwen3.5-27B-Q4_0.log:Maximum KLD: 16.334347
kld-Qwen3.5-27B-Q4_0.log:99.0% Δp: 5.843%
kld-Qwen3.5-27B-Q8_0.log:Cor(ln(PPL(Q)), ln(PPL(base))): 99.93%
kld-Qwen3.5-27B-Q8_0.log:Mean KLD: 0.001297 ± 0.000857
kld-Qwen3.5-27B-Q8_0.log:Maximum KLD: 8.743946
kld-Qwen3.5-27B-Q8_0.log:99.0% Δp: 1.564%
kld-Qwen3.5-27B-testing-IQ4_K.log:Cor(ln(PPL(Q)), ln(PPL(base))): 99.85%
kld-Qwen3.5-27B-testing-IQ4_K.log:Mean KLD: 0.006994 ± 0.001986
kld-Qwen3.5-27B-testing-IQ4_K.log:Maximum KLD: 20.218706
kld-Qwen3.5-27B-testing-IQ4_K.log:99.0% Δp: 5.227%
kld-Qwen3.5-27B-testing-IQ5_K.log:Cor(ln(PPL(Q)), ln(PPL(base))): 99.83%
kld-Qwen3.5-27B-testing-IQ5_K.log:Mean KLD: 0.007172 ± 0.003007
kld-Qwen3.5-27B-testing-IQ5_K.log:Maximum KLD: 25.653091
kld-Qwen3.5-27B-testing-IQ5_K.log:99.0% Δp: 4.244%
kld-Qwen3.5-27B-smol-IQ4_KSS.log:Cor(ln(PPL(Q)), ln(PPL(base))): 99.72%
kld-Qwen3.5-27B-smol-IQ4_KSS.log:Mean KLD: 0.012258 ± 0.002887
kld-Qwen3.5-27B-smol-IQ4_KSS.log:Maximum KLD: 25.531702
kld-Qwen3.5-27B-smol-IQ4_KSS.log:99.0% Δp: 6.415%
kld-Qwen3.5-27B-smol-IQ4_KT.log:Cor(ln(PPL(Q)), ln(PPL(base))): 99.76%
kld-Qwen3.5-27B-smol-IQ4_KT.log:Mean KLD: 0.008338 ± 0.001371
kld-Qwen3.5-27B-smol-IQ4_KT.log:Maximum KLD: 13.889874
kld-Qwen3.5-27B-smol-IQ4_KT.log:99.0% Δp: 6.159%
kld-Qwen3.5-27B-smol-IQ4_NL.log:Cor(ln(PPL(Q)), ln(PPL(base))): 99.75%
kld-Qwen3.5-27B-smol-IQ4_NL.log:Mean KLD: 0.008401 ± 0.001727
kld-Qwen3.5-27B-smol-IQ4_NL.log:Maximum KLD: 17.546616
kld-Qwen3.5-27B-smol-IQ4_NL.log:99.0% Δp: 6.049%
kld-Qwen3.5-27B-unsloth-Q4_K_M.log:Cor(ln(PPL(Q)), ln(PPL(base))): 99.77%
kld-Qwen3.5-27B-unsloth-Q4_K_M.log:Mean KLD: 0.009447 ± 0.001670
kld-Qwen3.5-27B-unsloth-Q4_K_M.log:Maximum KLD: 16.969128
kld-Qwen3.5-27B-unsloth-Q4_K_M.log:99.0% Δp: 6.628%
kld-Qwen3.5-27B-unsloth-UD-Q5_K_XL.log:Cor(ln(PPL(Q)), ln(PPL(base))): 99.88%
kld-Qwen3.5-27B-unsloth-UD-Q5_K_XL.log:Mean KLD: 0.004227 ± 0.001534
kld-Qwen3.5-27B-unsloth-UD-Q5_K_XL.log:Maximum KLD: 15.644795
kld-Qwen3.5-27B-unsloth-UD-Q5_K_XL.log:99.0% Δp: 3.990%
For comparison, a preliminary sweep on a small custom dataset (8 chunks -c 4096), validation corpus is chat-wrapped (Qwen3.5 format with <|im_start|> tokens)
BF16: Final estimate: PPL = 6.2951 +/- 0.13521
I'm doing the logits for a new run with wikitext2, plain text, I'll see if it's coherent.
edit: those results have been taken after the latest wave of quant update, only lmstudio have yet to fix them. I haven't included DevQuasar since not only they haven't updated them but one of their quant is mxfp4 (which result in a Q8_0 when it's not an MoE), I haven't included dinerburger either since the quant is relatively massive (20.2gb).
Well, results mid-test are already so different trend wise between wikitext2 and the tidy-tiny validation dataset that I'll simply discard this one.
I'll redo everything with a larger ~50 chunks chat-wrapped dataset and include the 72 chunks wikitext2 as a comparison too.
Thanks for running some numbers and sharing your results including some of your methodology! A few thoughts:
- With llama.cpp the results change with the context length, the tradition is to try to use at least 100 chunks at 512ctx. But as long as you are consistent with your own methodology you can compare against your own results some relative values. I have some of my methodology at least for PPL mentioned there. The results should be invariant with the batch size at least. I know there is some research going on in BeaverAI Discord proposing to modify
llama-perplexityto be more like turboderp's exllamav3 and vLLMs windowed/strided computation which may be more invariant to ctx length. We'll see. - So your graph shows a spread roughly 6.30 ~ 6.36 and the noise margin reported is +/- 0.13521 so at least by perplexity they are all very close. The KLD shows almost two different ranges with most of them around .006 with a few up higher. Do you know if the mradermacher quants here are using imatrix (just curious) [or is that the i1 prefex means imatrix and no i1 prefix means no imatrix?]
- Yes, the results can vary a lot by choosing a different text.
I'm still digesting what actually changed with the latest crop of UD quants and all that jazz and if anyone is noticing any differences between the old ones etc. Everything is moving fast!
Cheers!
That's correct, for mradermacher quants i1 means imatrix and no i1 means no imatrix.
They have that note on every model card, clearly marking weighted/imatrix vs static quants.
100 chunks at 512ctx, got it. I'll do that next time, I'm def not against shrinking the error bar.
Wikitext2 is done but it's still chewing through the 47 chunks custom dataset. All of theese runs (the Qwen3.5-27B ones I mean) are using a context of 4096 so at least they are comparable in that regard.
I'll be posting a "barbell" plot here, so we get to see the difference with a "not so boring" dataset with chat template and a plain text english one (it won't say much about the quant themselves but the recipe success I guess).
I got both mradermacher quant types with the one using imatrix noted "i1" as he named them.
As if anyone is noticing any differences, I mean it's largely vibes these days and for the few percents who noticed, I hope they'll consider benchmarking and sharing their results as I don't like the hype solely based just on figures published by the labs.
Thanks for the input.
@ubergarm thanks for sharing your numbers! I'm starting to think it's quite hard to get the whole picture over all this... The calibration dataset, the perplexity dataset, the resulting PPL and KLD, the usual benchmarks, the real world cases... That's a lot! That was easier when we knew less :D
That said I'm still eager and curious to see what the new runs of @cmh will bring!
Offtopic but artificialanalysis just added the results for Qwen3.5-27B in non-reasoning mode... Crazy to see it still crushes gpt-oss-120B (high) and is on par with Qwen3.5 35B A3B in reasoning mode... This release is such a gap... And I don't know how much you guys used it so far, but I personally really feel the difference, it's blatant. Yesterday I was looking at my model library... And thought "well, I can't actually justify to keep all the other ones anymore". Kind of a new era here...
Working on a tool myself that should help. Generally reframest the question from KL and Perplexity to "potential for somehting to go wrong"
Events when generating an answer where the nucelus of the probability is concerning, with estimates of the impact. I'll share some previews of what I see so far, mainly using Unsloth XL quants and the 35B model.
this is generating 256 token continuations over 20 prompts, parts of various wikipedia sections, then checking the difference at any given token vs the baseline, greedy path.
Red is a reduction of probabilty, blue an increase, and the border means that the top token shifted.
then now I am chewing theough the data to be able to categorize by domain, and how often the error is of a certtain nature, or tier.
Tier 0 — Invisible Shift
The output is unchanged, but the probability landscape has moved. The model arrived at the same token by a narrower margin. No visible effect here, but the distribution is subtly different from what a full-precision model would produce — elevated risk that compounds silently over time.
Tier 1 — Precision Loss
The output changes, but not obviously. A technical term becomes a softer synonym. A precise date becomes approximate. A confident claim becomes hedged. Nothing is wrong, but something is less right. The Model feels dumber. It is hard to point to, easy to feel across a long conversation or document.
Tier 2 — Isolated Error
A specific fact, name, date, or definition is wrong. The error is real but contained , the classical "hallucination" event. The model moves on without building on it. Individually visible if you know the correct answer. Invisible if you don't.
Tier 3 — Cascade
A wrong token becomes an anchor. The model is self-consistent by nature, so it builds coherently on the wrong foundation. One wrong year becomes a rewritten timeline. One wrong name becomes a different person's story. The cascade could last a sentence or the rest of the generation. The entry point can be spotted, presumed, but without following that path.. no clue what might happen.
Then the final "usable" and easily understood metric by people will be a fermi estimation of these values, the genral impact of quantization beyond some number that means nothing to them, per 1K token etc, how many possible events of each type? based on the general domain or topic, how is it for code etc.
I am also going to use it on quantized KV vs unquantized at longer prefills, like summaries at 64K, listing the characters in a story, writing code and so on.

here at q8_0 KV the model is losing cofidence in the text being "at the mountains of madness" while starring at a 50K context of the entire novella.
The visualization is such a great feature, you can probably guesstimate on a multilingual text if a language is impacted for example. Very Neat.
Okay, the dumbbell plot isn't the most legible plot, anyway
edit: available on https://www.reddit.com/r/LocalLLaMA/comments/1rk5qmr/qwen3527b_q4_quantization_comparison/
You're welcome.
indeed @owao i think at lower contexts it is probably fine? at BF16 the differnce stands out, but the probabilites are already "disturbed" at Q8 and below.
I want to check it at 32, 64 and 82 and 128K ,
relative comparisons between the "clean" quant and its quantized KV counterpart.
I will probably focus on Q8-Q4 for that.
agentic work will often require a lot of prefill, it seems? checking at low ctx is quite useless.
I might also see if I can supress eos, and force the model to write a very... very long story.... gives us a good idea of when and how the KV starts to deteriorate under quantization.
I have actually been wondering why, other than convenience KL and Perplexity, or even imatrix is never generated against its own output.
The imatrix set might make it pick bad tokens if not done right? But I am thinking that the goal is to preserve the quality of the baseline right? the model should thus be tested against its actual self? And imatrix be generated against its own preferences? but I am not an expert on quantization., it might be bad for a variety of reasons.
I am just some dude who's annoyed enough at the lack of testing on everyday stuff, outside of code and such, typical benchmarks.
The models have the edge there even when it comes to web data. all the pop culture and "causal data" is a much smaller part of the training set than you'd think.
And quant measurments like KLD is great for ststistics, but lacks context, but hey, that part is coming along real fast.
@cmh was a good idea to show the delta value between the 2 datasets! And this seems to reflect the effort Unsloth claimed to have made https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks#id-3-perplexity-and-kld-can-be-misleading to alleviate the calibration/evaluation "contamination" issue. And we can see @ubergarm method did a great job too :)
relative comparisons between the "clean" quant and its quantized KV counterpart
I've been wondering for a long time too! I also always wonder how bad is kv quantization for MOE vs Dense and small vs large models. I think this depends on too many factors to end up with general rules, but it might still be really interesting to have a tool to get those results for a specific model, so that we can run inference with full knowledge of the facts!
I might also see if I can supress eos, and force the model to write a very... very long story.
Why not using the base, pretrained model instead of the posttrained one? Sorry if my question is dumb
why, other than convenience KL and Perplexity, or even imatrix is never generated against its own output
I didn't get what you mean for KL. Don't we precisely compare it to the baseline BF16 model?
But I am thinking that the goal is to preserve the quality of the baseline right? the model should thus be tested against its actual self? And imatrix be generated against its own preferences?
I think it's actually the flow, at least thats how I see it! But I still have to learn how this works, to me the imatrix part is still a magic trick so far. But now I'm finally starting to use those calibrated models I'll have to understand at least a bit what I use!
@owao
the base model still has some tuning, and it will still stop. the IT model will generally be able to continue for a very long time if it must. worst case, it is simply told or encouraged to continue.
I mean that the KLD is usually measured against something like wikitext, just like perplexity.
You have your datatset, you do what you would as if it was perplexity, but you also store the distribution, that is used to measure KLD.
I am asking about that dataset. instead of wikitext etc, measure against the models own greedy output, not some "random" text.
⚠ NOT FOR LEARNING ⚠
measure against the models own greedy output, not some "random" text.
To me it's what we are doing for the KLD part, but maybe I misunderstand it. To me:
- you take your baseline BF16 model
- you feed it in parts of wikitext for example
- you compute how well it predicts the next tokens compared to ground truth
- it gives you the PPL
- you also store the logits (tokens + their probabilities) for this BF16 model (.bin or .kld file)
- you do redo the PPL compute for the quant
- compare it to the BF16 logits
- it gives you the KLD
From what I understand, PPL step is to assess the language understanding capabilities of the model, and KLD assesses the difference between the predictions of the baseline vs the quant (regardless that the predictions were good --> low PPL or bad --> high PPL).
So to me, to calculate the PPL you absolutely need some "ground truth" text. To calculate the KLD, we could instead of some text from a dataset, feed it with any random token, maybe that's what you mean? But if it is greedy decoding anyway, what the difference between starting from an existing text instead of "noise"? At the end we just want tokens and probabilities. I guess we just need a dataset for the PPL, but once we got the logits, we reuse them to save some power? hahaha instead of regenerating some logits from nothing
All this is based on my current model of the thing but I might be totally wrong!
I'll feed this entire message to Qwen3.5-27B tomorrow so it can tell how bad it is and teach me :D
⚠ NOT FOR LEARNING ⚠
Right, but the baseline is wikitext or whatever.
Not it's own responses.
The Perplexity and KLD is using Wikitext or some other dataset.
If you had asked the model to tell you about Apollo 11 and so on, and used those as the dataset.
It's the difference between checking how good the model is at predicting the source text you fed it. Wikitext for instance is bad, because the model has probably seen it.
If you use the models own outputs, responses to queries, reasoning chains etc. you're not asking if the model is good at predicting some external text and how that changed, you're checking if it still produces the same answer to the same query.
Yeah I get what you mean, but I don't think whether the dataset was present or not in the training data actually matters. Because whether you ask it to continue a text it has seen or answer question it never saw anywhere before, it still basically just infers from the context based on what it learned. And what only matters is what are the next tokens and their probs. For wikitext that might be high probs for the top-p tokens while for new topic it might be more distributed, but all that counts is getting the data, as long as for both setup you take the same top-p range and enough precision, both might be equivalent. But that's once again just as I see it as of now. Still need to learn many of the basics! So many chances are that I'm totally wrong and a math guy is tearing their hairs reading this. I think I'm testing this theory where to get the answer to a question, you say shit online and will always end up with a specialist who can't let you say shit, and so in the end you get your answer 😆 I first thought it was bad morally, but if in the end it serves everyone, maybe it's worth it 😅
Could be, could be. I am not an expert either. So we can only try things and learn from it.
My intuition tells me one method is more "relevant" to the goal? but I am not an expert. I think there are a lot of people in the community that also learned from trying stuff rather than understanding the math and the theory and properly understands this technology. if the dataset is big enough, it probably ends up in a wash? if the datset is not "tainted" at least.
Where are the experts that don't let people be wrong on the internet when you need them haha.
I think it's good to put thoughts and ideas out there, as long as it is clear that it could be wrong, and one is willing to learn from corrections from actual experts or rebutals.
Where are the experts that don't let people be wrong on the internet when you need them haha.
They all had these exact same kind of arguments two years ago after ik implemented imatrix and newer quantizaztion types in mainline llama.cpp:
- https://github.com/ggml-org/llama.cpp/discussions/5263
- https://github.com/ggml-org/llama.cpp/discussions/5063
Also unsloth is still re-re-uploading quants I guess, but amusing recent reddit post with their 99.9% KLD data (though mean KLD might be a better comparison...): https://www.reddit.com/r/LocalLLaMA/comments/1rlkptk/comment/o8t0wg6/
Could be, could be. I am not an expert either. So we can only try things and learn from it.
My intuition tells me one method is more "relevant" to the goal? but I am not an expert. I think there are a lot of people in the community that also learned from trying stuff rather than understanding the math and the theory and properly understands this technology.
That's pretty much what I've heard, that this field, complexe and multi-disciplinary, is like building a plane and figuring out how it's flying after the fact so people tend to learn machine learning top to bottom because at the bottom there's an endless pit. So I guess it's better to have a project (to show eventually) and learn just enough to move on.
As for the the KLD sheananigan. I've updated my post about Qwen3.5-35B-A3B-Q4 when they requantized
I edited my post at the end: "This benchmark reports Mean KLD, which averages divergence across all tokens. Unsloth's graphs use 99.9% KLD (the tokens where the quant diverges most from BF16). Both are valid but measure different things: Mean KLD gives an overall quality signal, while 99.9% KLD is more sensitive to catastrophic individual token failures. They're complementary."
I've put both the old and the new quants for people to compare since everything's about vibes these days. it's not not a passive-aggressive move, just that I don't want people with old quants wondering why the model is bad on their end.
edit: oh they also contacted me (not u/danielhanchen who's probably pissed but u/yorascale)... after I've put their new quant to ask for a "redo".
I like unsloth but they obsess a bit too much about quant release time. I personnaly think it's better to just wait at least a good week before downloading anything when it's a new arch, just for things to get ironed out. Then ofc test them.
Thanks for the links!
yea, I really question that shift to F16 over BF16... that could seriously backfire.
by the way. since you're doing new quants.
I find their UD-IQ4_XS to be a good size for a 3090 + 64gb of ram on Windows. give or take a few gigs at most. However, it is slow as all hell. sadly, IK has not been able to outperform mainline so far for me in inference speed. it might be worth checking if one can get a quant around 55-65 GB. the Q4_K_S quant they got runs fine, but lands at 66.7 GB, a bit tight for just some beefed up workstation . The quant you got right now, did not, but that might also be IK_llama and Windows not being friendly.
A mainline friendly quant then that targets that range migth find a lot of use for people without a dedicated llm rig. good quality, and enough room to have it running without running into memory limits
I edited my post at the end: "This benchmark reports Mean KLD, which averages divergence across all tokens. Unsloth's graphs use 99.9% KLD (the tokens where the quant diverges most from BF16). Both are valid but measure different things: Mean KLD gives an overall quality signal, while 99.9% KLD is more sensitive to catastrophic individual token failures. They're complementary."
Yea, the outliers matter, but it is also very good to know how it does in general. It's a communication issue I think. It's why I started my little project. It's context. People see that the values are very close to the baseline, and without any idea what that means it's kinda useless or missleading to many.
How good is the quant in general? and how bad does it get? That's what we got now, and it is so important to have it.
And what does bad mean? What's the risk? That one is what I am working on I suppose.
Edit: yea, they rush. I get that, but when they keep re-uploading... it's a hot mess.... they should spend the time to get it right in the first place.
I find "we took the time to really figure this model out" to be a better goal than "we got there first! now hold on as we fix all the issues over the next few weeks..."
yea, I really question that shift to F16 over BF16... that could seriously backfire.
I had a whole discussion with noctrex and friends about how if you want to downcast the original bf16 to fp16 you must confirm that no weights exceed +- 65k otherwise you'll end up clipping the values to fit them into the smaller number of exponent bits and reduced dynamic range... sigh
It could be okay, but I haven't seen if they confirmed by writing a script to check all the weights first...
Besides, I don't recommend leaving any tensors in the quantized models at full 16bpw (bf16 nor fp16) as q8_0 is half the size and 99.5%+ quality probably without clipping worries... anyway haha...
3090 + 64gb of ram on Windows. give or take a few gigs at most. However, it is slow as all hell. sadly, IK has not been able to outperform mainline so far for me in inference speed. i
Which windows builds are you using? The https://github.com/Thireus/ik_llama.cpp/releases/ or the recent docker containers by https://github.com/Steel-skull/ik_llama.cpp/pkgs/container/ik_llama.cpp ?
The mainline llama.cpp qwen35 CPU implementation is very slow due to delta net. ik has good chunked delta net, so likely teh way to go. If you can switch to Linux for your inference rig I'd recommend it (or at least dual boot?)
A mainline friendly quant then that targets that range migth find a lot of use for people without a dedicated llm rig. good quality, and enough room to have it running without running into memory limits
Does AesSedai have anything in that range for you?
Which windows builds are you using? The https://github.com/Thireus/ik_llama.cpp/releases/ or the recent docker containers by https://github.com/Steel-skull/ik_llama.cpp/pkgs/container/ik_llama.cpp ?
I build IK_llama directly myself. I haven't tried any premade builds.
The mainline llama.cpp qwen35 CPU implementation is very slow due to delta net. ik has good chunked delta net, so likely teh way to go. If you can switch to Linux for your inference rig I'd recommend it (or at least dual boot?)
I used to have a linux drive, but that nvme ssd recently died. Sadly they are not cheap to replace right now... I'll be looking to grab somehting as soon as it is feasable to do so.
Does AesSedai have anything in that range for you?
sadly no. their quants are either as slow as the I-quant unsloth, or a full K_M size, which is about 75gb? so no. nobody has anything that really hits that market. A normal Q4_K_S is already 72GB, and even a normal IQ4_XS is 65GB. So we're probably after somehting like your 35B model, using legacy quants and K-Quants. by process of elimination I can only say that the I-quants are running 5+ TPS slower. the difference in speed between the UD-IQ4_XS and the Q4_K_S from unsloth is about 5tps in favor of the bigger quant. UD-IQ4_NL runs just as badly as the UD-IQ4_XS.
I would attempt to quant it myself but... I literally do not have the storage for it. especially not an SSD, see the aforementioned sudden death of my linux nvme.
Just found this evaluation specific to the 27B, sadly yours @ubergarm wasn't tested, still useful
https://x.com/bnjmn_marie/status/2029227800574447958
bartow IQ4_NL is 15.9GB
unsloth Q4_K_XL is 17.6GB
Seems bartow won this one
And I didn't expect this but (because I often read that I-quants were slower at inference compared to Q-quants) but here I'm getting 29.5t/s over 512 tokens for Q4_K_XL versus 32.1t/s for IQ4_NL.
Just found this evaluation specific to the 27B, sadly yours @ubergarm wasn't tested, still useful
https://x.com/bnjmn_marie/status/2029227800574447958
bartow IQ4_NL is 15.9GB
unsloth Q4_K_XL is 17.6GBSeems bartow won this one
Nothing against this person but I would prefer more details, it is a bit frustrating. Maybe some of those benches are completely saturated and only one or two are actually relevant. Also "relative errors" can be deceiving.
The most critical would be the eval environment because we don't have the temp or a regex is wrong or if the chat template have been correctly implemented and if the eval doesn't requires one, is it really reflective of the model's performance when this audience is expecting reasoning capabilities ?
This is not hard to run but a slight error in one of the config files can drastically change the results (even if everything is tested in the same conditions) so it requires transparency (I mean not a jpg on X and a paywall).
Again idk who this is and they're probably more qualified than me but damn but those tests seems superficial. So I'd take those results with a cubic meter block of salt.
Influencers gonna influence.
It's pretty sad when for example unsloth is going to use those figures as a justification for their mistakes.
https://www.reddit.com/r/LocalLLaMA/s/gihC3Xf3f9
Edit: I'll do the same kind of bench myself, I bet that I can do some errors/tweaks as to obtain different results that will get amplified on a relative error chart. Sorry for the rant.
Sorry for the rant.
No no that's legitimate! And I agree the observed results aren't isolating the quality of the quantization itself, but the entirely cooked GGUF instead, so metadata included. But as a user, I'm happy to get an idea of what to expect when using the "end-product" instead of potentially observe different behaviors when I run it. That's what I'm looking for searching for benchmarks. But that's just for me, I get your point which is fair too!
At least for the chat templates, Unsloth and Bartowski are using the same (appart from a minor tool call tweak, rest is identical):
https://huggingface.co/unsloth/Qwen3.5-27B-GGUF?chat_template=default
https://huggingface.co/bartowski/Qwen_Qwen3.5-27B-GGUF?chat_template=default
I'm wondering if running the benchmarks is cheaper than paying 7$ 😅 But I agree seeing research results behind paywall is always frustrating. That said, it's not financed by public funds, so I can't really judge if they want to pay back their own effort and costs... But that's just a point of view indeed.
PS: I think their comment there is actually quite fair, and partially answers the contamination/benchmaxxing you pointed: https://x.com/bnjmn_marie/status/2029873413246849155
To be clear, I have nothing against monetization, but benchmarking LLMs requires attaching the methodology, even for a quick run, ballpark figures type of things. Otherwise I get irritated. It's the equivalent of citing a paper that has not been peer reviewed.
Edit: using a saturated bench add no value tho, I mean he did criticize kld measurements but at the same time justifying using a saturated bench (without due disclosure) for spotting quantization issues, it doesn't make sense to me
It sounds like post-hoc rationalization, maybe I should check his substack before running my mouth tho.
https://kaitchup.substack.com/p/lessons-from-gguf-evaluations-ternary
"Chapter 6: Evaluating LLMs[under conception]"
That's sad.
Some infos on https://kaitchup.substack.com/p/lessons-from-gguf-evaluations-ternary
MMLU-Pro: 500 questions
GPQA Diamond: 50 questions
LiveCodeBench v6: 100 problems
Math-500: 100 questions
What we want to check on:
What's zeroshot, n shot, etc.
MMLU-Pro does extraction with regex (to discard the reasoning for example), it has to be configured.
GPQA Diamond can be with or without chain of though and after audit the "inherent error rate lower bound is 26.8%"
LiveCodeBench requires adding a new model manualy, did he consult the errata before picking his task, we don't know.
Then Math-500 which is saturated and represent 13% of the test.
Eval is hard. He's a scholar he could do better.
His words:
"Doing some research on LLM quantization?
If you work on quantization and submit papers to top-tier conferences, use real benchmarks, generating tokens (and NOT exploiting only logits), to evaluate quantization quality. Papers using perplexity are increasingly rejected, and that’s good. I saw way too many quantization methods preserving perplexity but producing completely useless models."
There he's conflating measuring quantization quality and evaluating accuracy on some particular benchmarks. It's correlated but it's different, it's not the same scope. He is somewhat wrong to dismiss KLD-style measurements as a tool for evaluating quantization quality specifically. KLD measurement is like having the Mona Lisa and a copy and evaluating the quality of the copy, it's not about how beautiful the painting is.
Speaking of measurement, datasets and benchmarks.
I'll soon be needing more prompts for my setup. Math, reasoning, code, various topics that an LLM might be trained on, and things that it might have more trouble with. Long tail, medium tail... So if anyone has any specific ideas regarding which prompts to include I'm welcoming them.
I currently only got 20, and they're not diverse enough to check how the quantization changes depending on the task and the domain, perhaps even multiple languages.
I'm aiming for at least 100 prompts.
I'll need a lot of data before I feel confident in drawing out any statistics that will be used to classify the potential errors.
I don't want to generalize the impact of quantization either.
I suspect that there's a lot of redundancy in the model, and I suspect that more sensitive topics for the model, where it might not have a good leg to stand on in terms of being factual or just comprehending the task, will take a bigger hit.
Thanks for the link to that guys blog, I had seen some of those charts out of context shared on reddit.
For starters, he is talking bout:
Ternary weights (TQ1_0), where parameters (excluding some layers) become {-1, 0, +1}, still track the original model closely.
But if you look inside all of the UD-TQ1_0 quants they do not actually use any ternary weights, but are simply a mixture of small quantization types like IQ1_M and such. I've called unsloth out on this numerous times, and the creator of that ternary quantization type, compilade, has also brought it up with them. My understanding is unsloth just needed a "name slug" for their smaller model size and went with TQ1_0 as it has a roughly similar BPW size as the mixes they were making.
Anyway, interesting reads. I'd say providing PPL and KLD are better than providing nothing in terms of relative quant comparison from the same provider. I do wish anyone providing data also provided the full exact commands they were running such that others could reproduce more easily (i try to do this though my stuff is kinda spread out at this point hah)
Its a fun interesting challenge!
Yeah you are right ubergarm I saw that one of those discussion with unsloth asking to rename it precisely to avoid confusion! And here it is!
Sorry all, I realize I have shared the post a bit hastily without really thinking critically. I feel like the dog bringing back something he found outside :D
The intel author of auto-round had some discussion with ik about 6 months ago here: https://github.com/ikawrakow/ik_llama.cpp/discussions/657#discussioncomment-13935697
In my limited comparisons of some low bpw GGUF quants made by them didn't seem amazing at first glance: https://github.com/ikawrakow/ik_llama.cpp/discussions/657#discussioncomment-13935824
I haven't revisited lately though.
@ubergarm thanks Hmm... I would be thrilled to give it a try on our specific 27b gem, but I'm back on 4G-mobile phone tethering as of today and will be so until next friday 😢
Oh @espen96 I have one for you: Reverse this string: '.DefaultCellStyle'
I love this vibe test prompt cause it's really minimalist and tricky!
Almost all small/medium LMs I tried fail at it. So far, the only one I found reliably getting it correctly are gpt-oss-20b (so I assume the 120B would too) and Falcon-H1R-7B (taking about 20k tokens of reasoning before coming back with its answer when just tried right now before posting this message! It's ability to explore and come back is really impressive...).
I didn't really manage to put up all the factors leading to many other models to fail (including this Qwen3.5-27B and 35B-A3B). First I was mainly thinking of the vocab, "maybe it just doesn't have the adequate tokens" and so the task was doomed to failure right from the start. But the more I think about it, the more I tell myself it actually involves several other implicit skills. I'm not asking anyone to dig into it apart from genuine curiosity and interest. But just like that, does anyone have some few intuitions about this?
Great weekend to all
EDIT: actually 27b some really rare times, succeeds, we are not at a 100% failure rate
my factual benchmark is perfect for exploring this as it uses a beam search and probabilities to check if an llm knows the factual answer, and how likely it is to output it in a direct continuation as the next few tokens.
think: "The Nintendo character Mario made his first debut in 1981 in the arcade game titled" and the llm must have a reasonable chance of outputting "Donkey Kong" or some other valid option.
except it won't be a reasoning task, and more of a direct completion. I bet it's like the the strawberry "test". LLM's are just not setup for this.
Qwen3.5 35B at Q8 did not lead to a single valid path with a beam width of 100 tokens @owao
it also seems to struggle to spell it out.... this is a really interesting one! I bet it is a tokenization issue, or vocab.
the model while reasonig, and being told to spell it out gave me:
"
Here is the step-by-step reversal of the string '.DefaultCellStyle'.
Identify the characters in the original string:
. D e f a u l t C o l o r SSpell out the characters in reverse order:
S r o l o C t l u a f e D .Combine them into the final reversed string:
SrolCotluafeD."
Yea, I think you broke it with that promp. intersting! I will see if I can turn any of this into something that works for the quantization benchmark.
:DD yeah I think it might be bound to vocab in the first place, still need to look at https://huggingface.co/Qwen/Qwen3.5-27B/blob/main/vocab.json but if this not the limitation, it touches many aspects!
I have to be fair, it's not from me! I saw it on reddit like a year ago and also found it interesting! From time to time I like to test it against new models :) that's a tough one!
When looking at the vocab, interestingly we find the token for Qwen3.5-27B: "DefaultCellStyle": 82607, but gpt-oss-20B and Falcon-H1R-7B don't have it: that's my initial contribution :)
https://huggingface.co/Qwen/Qwen3.5-27B/blob/main/tokenizer_config.json
https://huggingface.co/tiiuae/Falcon-H1R-7B/blob/main/tokenizer.json
https://huggingface.co/openai/gpt-oss-20b/blob/main/tokenizer.json
before I start on my dataset properly, since I now have more metrics, and some semantic data.
These prompts are generally quite tricky it seems, so the result seems to show that the model will quickly start to have hallucinations compared to baseline.
Q8 seems to be very close to baseline, but then it starts to get worse.
Most errors then are not concerning, but it appears that quantization is not just amplifying the existsing errors, but flattening out the distribution, and making confabulations more likely.
I would note. these are cold starts. so extrapolating into the rest of the conversation won't work as well.
I suspect that multi turn will change the error profile, flatten it as it now builds upon previous context.
So if the first 1-2 turns is where an llm is most likely to run into clear cascade events, I suspect future turns fall inot the soft degradation of tier 1 + context degradation, the errors might be less obvious and impact soft reasoning skills.
ok, so I ran the 27B with the Q8_0 as a baseline. this is over 88 prompts.
and if we look first at
and then compare it to what I am seeing here.
one thing is clear, q2 is not fine, that quant might be passib benchmakes to some extent, but it is out picking berries.
Q3 as well, it is not healthy and is way of compared to Q4 and higher.
and
stand out to me as very clear examples.
here it is clearly struggling with numbers where it should not, and basic chemistry "6CO2 + 6H2O + light energy → C6H12O6 + 6O2"
another thing to note is in "Max Absolute Divergence"
the bars that are crossed over have at least one token in the test where it diverged 100%. Meaning, we went from a token that had a 100% outcome, to selecting another one with 100% confidence.
for most of the q4 ones, it is 1-2. for q3 it is 6, not bad. for q2 it is 104.
Most of the risk will be early on, then the model will start to settle. It drops, and it more or less stays or returns quickly to the baseline.
I'll have to prep less generic stuff next, to test agentic performance, long context, reasoning, and then it can be packaged and shipped out once I run my sematic tests on it all as well, to give hallucination rate preditcions.
I'll also grab some other, non-unsloth, quants for a comparison as well. I'll probably focus on Q4 as that's where the competition is. Q5 is a meaningful upgrade however, as seen above. I'll probably grab a few Q5-6 quants as well, and also any Q2-3 special quants that show up, just to demonstrate the massive difference.
Also, I do apologize for using this as a place to share the preliminary results. Hope nobody is too annoyed.
Interesting. But for your next benchs you can also include Ubergarm one! hahaha
Question: that's disturbing to see even Q4_K_XL sometimes flipped (if its the correct word) the "CO" token in "6CO2 + 6H2O + light energy —> C6H12O6 + 6O2", on such a simple and common equation. Do you have the runs history stored somewhere to see what it flipped to? I can't imagine it got it wrong but at the same time I can't guess what it could have replaced it by.
Also, I do apologize for using this as a place to share the preliminary results. Hope nobody is too annoyed.
It's plenty of graph, more than what I can process personally! But maybe it can talk to others!
Interesting. But for your next benchs you can also include Ubergarm one! hahaha
Question: that's disturbing to see even Q4_K_XL sometimes flipped (if its the correct word) the "CO" token in "6CO2 + 6H2O + light energy —> C6H12O6 + 6O2", on such a simple and common equation. Do you have the runs history stored somewhere to see what it flipped to? I can't imagine it got it wrong but at the same time I can't guess what it could have replaced it by.
luckily, it seems to have switched between two token variations of CO, so this time, not the most meaningful potential swap. However. the baseline q8 was 78.7% certain as to how it prefered to express "CO"
at Q4 it was still preferring the same token for it, minor variation.
the 6 however. that started out as a 100% probability. it was going to start writing "6" all the way down to Q3. at Q2 however, it is now so shaken up that it's second highest preference was "the" at 8.8%, and then what looks like latex at 6.8%.
it could have said " The overall equation for photosynthesis is: The" which is a nonsense continuation.
And yea, it is a lot of graphs. I certainly do not plan on exposing this much stuff in the completed setup. That's how it goes when you gather preliminary data. You look at it as many different ways as you can and keep the stuff that is useful.
I am certainly glad I am doing this myself. To me, it's a clearer picture than KLD, PPL, and a different metric than connventional benchmarks.
two token variations of CO
Reassuring! Thanks!
Added the Ubergarm-smol-IQ4_NL quant.
It lands in-between Unsloth's Q4_K_M and their Q4_K_S, closer to the K_S
More disrruptions marked as "minor" than K_M, but it wins over K_S. About matches K_S in Moderate, which is probably what we might consider swings that make the model "feel a bit off".
Matches K_M in severe errors. All in all... in terms of errors... on this dataset... they're about even. I would need even more data as it all changes per prompt.
I'll try to come up with a simpler visualization on my end for this type of quantization attention drift stuff, something less fancy (and data dense). More like snippets of generations side by side with first token drift and average divergence like you did. I think it's enough for most people to get the idea.
I'll probably settle on 3 or 4 domains (like english, code, language etc) and a short generation so I can test more quants.
I'll try to come up with a simpler visualization on my end for this type of quantization attention drift stuff, something less fancy (and data dense). More like snippets of generations side by side with first token drift and average divergence like you did. I think it's enough for most people to get the idea.
I'll probably settle on 3 or 4 domains (like english, code, language etc) and a short generation so I can test more quants.
Good idea. I would say not to pay attention to the top token flipping. Disruptions is perhaps imo the most important raw metric. It tells you something about how likely the model is to diverge, and how much.
Second is probably the disruption confidence. It makes it very clear that the model is not just getting worse where it was already struggling, it's losing confidence in areas it used to have it.
Thanks for running more numbers, and I see a lot of chatter on reddit too about various updated recipes. Its a lot to keep track of right now!
Added the Ubergarm-smol-IQ4_NL quant.
Great, thanks for running it! Amusingly this quant is designed for Vulkan backend experiment, as I was trying to get some more options for AMD hardware people.
I haven't released ik_llama.cpp specific quants for some of these new Qwen3.5s as I'm holding out a bit to see if there is any consensus on keeping some tensors even higher bpw and not downcasting them to f16 etc.
So i'm looking! thanks!
Something like that: https://huggingface.co/spaces/cmh/Qwen3.5-9B-GGUF-quant-drift
The script: https://github.com/cmhamiche/token_drift
47 quants from bartowski, lmstudio and unsloth. I've yet to run llama-perplexity on them for a kld ranking.
Something like that: https://huggingface.co/spaces/cmh/Qwen3.5-9B-GGUF-quant-drift
The script: https://github.com/cmhamiche/token_drift
47 quants from bartowski, lmstudio and unsloth. I've yet to run llama-perplexity on them for a kld ranking.
oh interesting. so you just let them run, and checked how often they were equal, or how much of it was.
I forced them to pick the same tokens, and checked the probabilities... which is slow becuase we do not have echo.
what is interesting is that it worked well on math and code, or worked at all. in pure language tasks.... nope. too much drift, instantly.
That is handy! I think it shows just how easy benchmarks can be for llms, and why my numbers are starting to scream as we go from q4 to q3 and q2.
I realized how borked it was on mobile, it should be responsive now.
Anyway, I went for the most naive, easy to understand visualization.
And yeah between domains, there's a large difference, I went for french (which shares a lot with english in term of vocab) but if I were to try korean or turc, it would be so much worse.
Unfortunately it wouldn't show really, it would still point at a drift at the 4th tokens (unless it's a larger model I guess).
lovely! I took the liberty to measure those prompts.
Let's zoom in on the differences. I did not download any extra quants. I am still using the q8 as a baseline for the 27B. obviously running a larger dense model is a headache... I might switch to the 9B as well for further dense model tests.
Man... I really wish I could measure DeepSeek R1 and those crazy quants from last year....
--Code--
--math--
--English--
--French--
Wait @cmh , did you only do one run per prompt per quant? Your UI is super sexy and well thought, I didn't feel overwhelmed :)
Also did you use the base model instead of the instruct one?
Sad we still aren't, and maybe will never be able to get our hands on the 27B base model :(
I'd like to be optimistic and think that the quantization would less hurt the 27B than the 9B one! Maybe we can still try to run on the instruct one, even if it is not suited for continuation task, it could still show something...
Thanks for the 2 codebases by the way! The kld-sweep will be handy too! Great job!
I'd like to be optimistic and think that the quantization would less hurt the 27B than the 9B one! Maybe we can still try to run on the instruct one, even if it is not suited for continuation task, it could still show something...
can't speak for the 9B yet, but I can tell you the 35B seems to be taking more damage than the 27B
You want to measure the base model? could be a good comparison... although it is not that much of a base model, it has a fair bit of Instruction training
Yeah actually I realize doing on instruct instead of base ones shouldn't really be an issue actually... Maybe to make it clean we could just prepend the user prompt with something like "Your task is to continue the given text" or so, or even set it as system prompt... But maybe just run it like this as you did cmh is fine... Don't really know
can't speak for the 9B yet
saw a good post on r/LocalLLaMA with some data covering some Qwen3.5-9B models: https://www.reddit.com/r/LocalLLaMA/comments/1rr72lr/qwen359b_quantization_comparison/
OK I think it's time to touch some grass 😄
It's not GGUF related, and only for nvidia and mlx, but I just saw this. It seems worth keeping an eye on it https://z-lab.ai/projects/paroquant/
sizes up to 9B have already been done
It's way, way, way above my paygrade but I find it funny that both papers arrived at the same conclusion about global rotation (meant for block scale issues but also finding outliers) when they are trying to tackle different problems: one says it is bad for mxfp4 accuracy, yours says it's expensive at inference.
https://arxiv.org/html/2511.04214v1
Indeed, gguf and 4bit bnb are using block scaling (blocks and super blocks for gguf), it's not rotation based even tho all those issues can be summed up by "outliers ruins quantization here's my hack for dealing with it" at the end of the day. Those papers are about introducing some granularity to the rotations, gguf is like nope "imatrix is good enough".
can't speak for the 9B yet
saw a good post on r/LocalLLaMA with some data covering some Qwen3.5-9B models: https://www.reddit.com/r/LocalLLaMA/comments/1rr72lr/qwen359b_quantization_comparison/
If I am to make an educated guess based on that. I'd say it's doing as expected.
Q4 falls at the bottom half of the corner with Q5 on the flat. I would say it looks to be grouping worse than we would like at Q4, but is doing as well as we might expect from a 9B model.
@cmh Thanks for sharing the wikitext2_test.txt file in the reddit thread for 27B. I took the liberty to upload a couple of IK quants, along with the llama-perplexity results using the same 72 chunks wikitext2:
You named your models after their bpw, it's so much better, I love it.
I wish I kept the logs but just by looking at the dumbbell graph "4.915bpw imatrix" (0.0646 at 15.37 GiB) would be lower than anything. That is quite impressive. The impact of the imatrix on 4.151bpw is also quite impressive, it went from 0.1084 to 0.0836.
edit: If I had a more vram, I'd run your 4.915bpw instead of AesSedai's Qwen3.5-122B-A10B-IQ4_XS
Oh and thank you for the shoutout.
Ran them in my bench @sokann
in terms of what I call "Predicted Fractures" Sokann-4.165bpw is a bit worse than Unsloth-UD-Q3_K_XL
20 predicted fracture events in the first prompt, vs 17.1.
Predicted Fractures use a Fermi Estimate to separate harmless noise from critical divergences. Instead of adding points for errors, it starts with the absolute ceiling of raw events and filters down based on structural and semantic probabilities.
what's more alarming is the amount of Catastrophic Breaks. these are events where the tokens didn't just have a certain probability of diverging, but events where it flipped 100%.
Which means the baseline choice was completely ignored.
In this benchmark we have 88 prompts so far, and Q4 quants so far from Unsloth and Ubergarm has scored 1-2 of these events in total, Unsloths Q3 XL had 6 such events, Sokann-4.165bpw had 17.
Sokann-4.915bpw is doing worse, however, since it has to be ran on IK's fork... the data is not comparable. Since the baseline is on Mainline Llama.cpp, implementation differences make the comparison invalid.
how invalid? well it has 23 break events, and scored worse than UD-Q3_K_XL with 18.5 PF vs 17.1 pf for q3 xl.
KLD and "catastrophic breaks" are measuring different things I get it but, just curious, how do you calculate the "semantic probabilities" since it's pretty vague, an embedding model ? I'm not trying to reconciliate both benchmarks, i'm just exploring.
KLD and "catastrophic breaks" are measuring different things I get it but, just curious, how do you calculate the "semantic probabilities" since it's pretty vague, an embedding model ? I'm not trying to reconciliate both benchmarks, i'm just exploring.
currently, Qwen 3.5 35B, Q8_0 with full reasoning, takes all the baseline responses, one by one. it then goes over all of it and breaks it down.
- word: The exact string from the text.
- word_score (1-10): The critical aspect. How vital is this word to the overall factual or structural integrity of the sentence?
- 1-3: Low importance. Formatting, common punctuation, filler words, generic articles (the, a, an).
- 4-6: Medium importance. Common verbs, adjectives, connecting concepts, language-specific syntax (brackets, commas).
- 7-8: High importance. Key nouns, specific actions, important qualifiers, significant mathematical operators or variables.
- 9-10: Critical importance. Vital entities, exact numbers, dates, unique names, core subjects, exact equation constants. If an LLM hallucinated here, the sentence's meaning would be destroyed or factually incorrect.
- word_type: The grammatical or semantic role. Choose EXACTLY ONE of the following:
- entity: Specific names, places, organizations.
- action: Verbs representing actions.
- qualifier: Adjectives or adverbs modifying meaning significantly.
- number: Exact numeric values.
- date: Specific times or periods.
- math: Mathematical symbols, operators, variables, constants, equations.
- code: Programming language statements, syntax, function definitions, keywords.
- formatting: Spaces, newlines, indents.
- punctuation: Commas, periods, quotes (outside of code/math meaning).
- content: General meaningful words not fitting above.
- function: Grammar words (the, a, an, and, or) and stopwords.
this goes into determining the value of each part of the text.
if we do NOT have this? the data we do have from other runs are being used as an estimate. we can roughly tell how much of a text is of what kind, if it is numbers, formatting, punctuation... and using data on how the auto categorization matches the actual categories we get the fallback based on statistics.
There are also weights telling us how important each is presumed to be.
Anyways. it is the analysis of an llm telling us the semantic category and importance of each part of the text. It is based on something I did early last year where I ran perplexity over entire wikipedia articles with some 8-12 llms to learn more about perplexity vs hallucinations on various topics.
it worked extremely well then, so this is largely a re-use of that. It allowed me to ignore noise on tokens with little to no importance.
Obviously not ideal, but it's fine for now. Currently working on generating data on 425 prompts, over 17 categories. that way we can tell how various topics are affected by quantization.
this is still based on the first 88 prompts, so it is skewed and not complete.

The full set has 25 prompts for each category, and room for long context, agentic etc. usecases
Qwen 3.5 35B, Q8_0 with full reasoning, okay, okay.
Very interesting, I'm that close to feel equipped to understand what's going on.
Would this repo help in any way? https://github.com/cmhamiche/kld-sweep-dataset
Or are your prompts hand-crafted for a specific reason?
Qwen 3.5 35B, Q8_0 with full reasoning, okay, okay.
Very interesting, I'm that close to feel equipped to understand what's going on.
Would this repo help in any way? https://github.com/cmhamiche/kld-sweep-dataset
Or are your prompts hand-crafted for a specific reason?
They are picked to represent both the things llms are know to, and trained to, be good at, as well as the "gradient" down to things they are bad at, like regional topics not covered as much in english, various languages etc.
your tool there looks pretty good!
so my prompts cover this:
Baseline (Highly Robust in most models)
Science
Technology
Computing_Theory
Programming_Applied
Mathematics_Theory
Mathematics_Applied
English (Grammar/Linguistics)
Linguistic Spectrum
Languages_Major (French, Spanish, Mandarin, etc.)
Languages_Minor (Navajo, Basque, Swahili, etc. Less likely to be well covered in an llm)
Cultural & Historical Degradation Scale
History_Western & Culture_Western
History_Global & Culture_Global (Anime, Video games etc.)
History_Regional & Culture_Regional
History_Minor & Culture_Minor
Western: US / European mainstream (Highest resilience)
Global: Non-Western origins, but massively exported and consumed by the West (High resilience, Anime, K-pop etc.)
Regional: Huge in their native countries (e.g.Japanese idols, Nollywood, regional African TV), but mostly ignored by the West.
That is why I got the dataset I got right now. it tries to make sure it checks for stuff we know is more likely to survive based on all those benchmarks as well as things we might neglect to test for.
You got major western stuff covered as well as code and all that, but it lacks the rest. Pop culture, regional topics, stuff not discussed in the west.
My dataset is quite big atm, but I am aiming to generate smaller sets once I have all the statistics to allow for better extrapolation.
I'll say, other than the categories and "catastrophic breaks". I totally expect to get to the point of using KLD as a way to estimate the other stuff.
Interesting.. How much work is required to be able to run the same benchmark fairly with both mainline and ik?
At the same time? Unsure. Since the implementation is different, we can expect them to have different token preferences, or different logprobs.
I am not certain if having a per engine baseline would be enough. Maybe?
I'm having to wait with calculating the margin of error until I have this run. There are a few methods I might choose, and I need all the data before I can say for sure.
Besides that, if you run all of it on IK then all the results are compatible.
I will say, for further testing I'd probably witch to the moe or a smaller model.
Currently, even without having more traditional quants from Bartowski and others, with the full 425 prompt dataset, I'm currently at 6375 total tests. And since there's no echo support, there's a lot of added latency when it has to stop and roll back one token if it doesn't match.
@Sokann if you make some 9B quants or 4B, I'd gladly test those. It would be a better test as I can run it all against the BF16 baseline then.
I'll probably be running a few smaller model families before extracting a small and a medium sized prompt set anyways, so one of those two would be prime candidates.
Admiring your perseverance guys! espen96 you a machine 😅
Currently trying https://huggingface.co/z-lab/Qwen3.5-27B-PARO I actually didn't think I would be able to run it on my 24GB VRAM but it fits! I guess I totally missed the --max_num_seqs last time I tried running a model through vllm. I notice the default value was really high (256 for my 3090!) and setting it to 1 or 2 for my regular use cut the memory requirements by a fair amount! That said other than some vibe tests and benchmarks, I guess I can't really add a value to what you guys are doing here :/ But I'd be glad to run a benchmark if you are interested comparing! So far I couldn't spot any difference
Data is data, and data is gold. Would love to see how it compares. So please do share!
I might have to migrate of JS at some point...
I did the math, and this run is generating 1.632 million tokens.... So adding more models is probably simply beyond what it can handle.... Ups.... Should have gone with python, or literally anything else.
Running :)
==================================================
Evaluating model: z-lab/Qwen3.5-27B-PARO
==================================================
Resuming from existing results (17 runs already completed)
Running 3 evaluation run(s)...
Prompt: Reverse this string: '.DefaultCellStyle'
Ground truth: elytSlleCtluafeD.
Results will be saved to results/z-lab_Qwen3.5-27B-PARO.json
Run 18: ✓ (extracted: elytSlleCtluafeD.)
Run 19: ✓ (extracted: elytSlleCtluafeD.)
Run 20: ✗ (extracted: None)
==================================================
Results: 6/20 correct (30.0%)
I'll do 50 runs for z-lab/Qwen3.5-27B-PARO, bartowski/Qwen3.5-27B-IQ4_NL, unsloth/Qwen3.5-27B-Q4_K_XL and ubergarm/Qwen3.5-27B-smol-IQ4_NL to start with.
But I'll need to let it run overnight because I want to start right away with reasoning mode, and this takes about 2-3 min per run! Though I'm wondering how I could differentiate better results between less errors overall, leading to less error compounding, and better error recovery. This can be interesting to do without reasoning and ask them to answer immediately instead, that should isolate raw errors. I'll do in a second time. Maybe doing 50 additional runs would be great too.
I can't believe I'm about to produce gold at home :D
Since the implementation is different, we can expect them to have different token preferences, or different logprobs.
Huh.. while the implementation is indeed different, I kinda just assume that both mainline and ik have their own correct implementations, such that they would give the same logprobs.
After some checking, the situation is more complicated than I thought.
In ik, I was aware that there was a brief <1 day period where a short prompt would result in gibberish with Qwen3.5. This was soon mitigated by https://github.com/ikawrakow/ik_llama.cpp/pull/1339. Very limited impact.
In mainline, I just noticed that there was a 7+ days period in between the merging of https://github.com/ggml-org/llama.cpp/pull/17795 and https://github.com/ggml-org/llama.cpp/pull/20463 where all the models were bugged.
Any benchmark done with an impacted mainline build must be redone. I am recreating my llama-perplexity test from scratch, since the KLD logits were generated by an impacted mainline build.
Huh.. while the implementation is indeed different, I kinda just assume that both mainline and ik have their own correct implementations, such that they would give the same logprobs.
Both may be correct, but they're not compatible. The same way perplexity values from llama.cpp and vllm are not directly comparable. The two engines have diverged quite a bit, and thus they can give results that are incompatible with each other, especially when the model implementation is not the same either.
A correct and valid implementation does not have to mean they give a result that is compatible across engines. Same same but different.
Quote from llama.cpp's own perplexity tool
"llama.cpp numbers are not directly comparable to those of other projects because the exact values depend strongly on the implementation details."
Regarding the issue with mainline, I'm not seeing anything obvious on my end?
Not saying it didn't affect anything, but the data looks clean enough to gain insight. Nothing stands out as unlikely.
But considering the mess right now, after this run is done, I'll probably switch to something older. Qwen 3, Llama 3.x, Ministral, Gemma 3. Those are as stable as it gets.
Feel free to suggest other smaller models that are likely to have a fairly unique fingerprint, to see cross model trends.
Then I'm calling it quits on Qwen 3.5 27B at the very least, until we know the dust has settled. More than enough time to figure out the best prompts for the shorter comparison runs.
fairly unique fingerprint
If you need long CoTs, you can try Nanbeige/Nanbeige4.1-3B (LlamaForCausalLM) or POLARIS-Project/Polaris-4B-Preview (Qwen3ForCausalLM), those are pretty unique!
Nanbeige already produces quite long CoTs and if I think Polaris are even longer!
I'm finally getting around to this again, and trying to find the original research suggesting to keep ssm_(alpha|beta) unquantized at bf16 or upcast to f32 (for speed)? I know there are some reddit posts suggesting these tensors are sensitive, and they are very small, but I can't find a reference with any benchmark showing the difference.
I did an experiment with 3x otherwise identical ik_llama.cpp quants and show the speed and quality appear quite similar for all three: https://huggingface.co/AesSedai/Qwen3.5-397B-A17B-GGUF/discussions/7#69b8404f18a5e8feffd9f5c8
I will likely upload the f32 just out of pure superstition as it fits nicely in 24GB VRAM with 128k context...
if you want to get really wild testing long context performance with any models running on CUDA, change this hard coded value to 0, recompile, and use -ctk f16 -ctv f16 and see if you can notice any difference:
https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-cuda/fattn-common.cuh#L13-L19
That offset can be tuned at runtime in ik_llama.cpp: -cuda fa-offset=0 as per: https://github.com/ikawrakow/ik_llama.cpp/pull/1198
I don' think using -ctk f32 -ctv f32 would help given psure the accumulators used during flash attention are not full f32...
neat! I will be running long context KV test soon enough. it seems Mistral Small 4 is also coming soon, based on the same architecture used for Mistral Large 3. Good timing for some LC tests.
I don't presume there is much to be gained by testing 256k to start? so I'll probably run tests for 8, 16, 32, 64, 98, and 128K that should give us a trendline.
Then I think I will focus on BF16, Q8_0 and Q4_K_M, and add more if I need to..
also:
I have to make sure everythign ran just fine, and toss bad prompts, but we can at least see extremely clearly how Qwen 3.5 27B is maintaining strong code completion and math performance. everything esel, which might impact both what it is able to recall, and possibly its ability to reason and understand the user... well it is dropping much faster.

Lacking a label chart, but the standout is q2_XL.
I will wait with the final data,
but one thing I am noticing. the gap between quants within a tier, q2,3,3 etc. is lower than the gap to the next.
You gain a lot by jumping from q3 to q4, and you gain a fair bit by jumping from q4 to q5, and so on. even the smallest q5 is quite a bit better than the largest q4.
You first chart is really insightful! Great effort again espen!!
For my own humble contribution, I was able to extend the samples to 100 runs instead of 50! I'm just mad I set it up in a hurry yesterday and didn't think about storing the # generated tokens per run :/ It could have been interesting to compare and would have been free additional data, but anyway...!
Here was the universal sampling params used:
(the official thinking [general] ones, the most creative among the 4 suggested sets)
"temperature": 1.0,
"top_p": 0.95,
"top_k": 20,
"min_p": 0.0,
"presence_penalty": 1.5,
"frequency_penalty": 0.0,
"repeat_penalty": 1.0,
For recall, the test was simply to correctly answer this user prompt (no system prompt): Reverse this string: '.DefaultCellStyle'
Accepted answers were simply those containing the string elytSlleCtluafeD. (so any 'elytSlleCtluafeD.' was indeed accepted too)
Here are the results:
| Model | Size | Score |
|---|---|---|
| z-lab/Qwen3.5-27B-PARO | 18.8 GB | 42/100 |
| bartowski/Qwen3.5-27B-IQ4_NL | 15.9 GB | 29/100 |
| ubergarm/Qwen3.5-27B-smol-IQ4_NL | 16.6 GB | 27/100 |
| unsloth/Qwen3.5-27B-Q4_K_XL | 17.6 GB | 19/100 |
Now I realize that would be more fair to compare the PARO one to quants of the same sizes... Nevertheless ouch for the UD-Q4_XL on this test!
For the setup:
- The PARO one was running the patched vllm server from the paroquant repo
- llama-server pulled and built from main as of yesterday (ah I'm sorry I can't tell what commit it was)
And if some wanna check or reproduce:
- The simple python script vibe coded with the PARO one 😄 https://paste.sh/SRj6rVuM#rbfPExsOxZ6oeY6f1cO-dw5X it's trash with redundant info logging, no error handling etc. but it did the job... Please don't look at me.
- The aggregated runs:
- z-lab/Qwen3.5-27B-PARO https://paste.sh/2GnZnyBg#G6jkivjuIqS9EiC-JOyp0k_W
- unsloth/Qwen3.5-27B-Q4_K_XL https://paste.sh/uUs4SPsg#a6uKtKKh4knbK4ie__EUnV7x
- ubergarm/Qwen3.5-27B-smol-IQ4_NL https://paste.sh/4mMN1gE1#02OBZ0OLlucgVR_KLJ3hPhOH
- bartowski/Qwen3.5-27B-IQ4_NL https://paste.sh/bTyattSf#4Df8642-3u5bneAbkl9CCFA5
Oh and I ran this with llama-swap (I love it!), here are the relevant part of my config file telling the servers launching commands: https://paste.sh/Iqp83qf2#gT1nE2-xPbWUQcMpCDWSrGLV
As smaller models are more sensitive to quantization, could be a good idea to run this on the 9B, this time picking similarly sized quants, and btw storing the response lengths!
could be a good idea to run this on the 9B
Unfortunately I can't run the same test on the 27B with "enable_thinking": false, and neither on the 9B (even with "enable_thinking": true, and despite them sharing the exact same vocab) because they just can't provide any accepted answer at all :/ So in despair I tried the 4B one in thinking mode as well, but as expected, wasn't even close to any valid answer...
ok, I know I was going to update the earlier post, but.
this spreead on the 35B is quite something.
the Q6_XL is practically on top of the Q8, and the @AesSedai quant is right on its heels.
Clearly, if you're looking at dropping to q8 at all, you might as well seriosuly consider dropping to q6 or q5 for the performance gains. You might have lost enough by going to q8 that the jump to a good q6, or a good q5, is nothing. Especially if one wants to use a model like this as an everyday flash model, or subagent alongside the 27B or 122B. It will have lost accuracy, and it might flip tokens etc. but if it has data to work with? I mean...
Those are some very useful numbers, @espen96 , and an exceptional visualization too!
I wonder if it would be too much to ask to see how AesSedai's IQ4_XS and Q4_K_M fare in comparison, as those are the biggest quants that can fit in 24GB VRAM completely while leaving some space for context, and they used to outperform Unsloth's quants (not sure how things stand now; there was a flurry of updates from Unsloth, some better documented than others).
And again, thanks for posting ppl/KLD charts for your quants, @AesSedai !
Can do @Maximm69, but I might suggest offloading experts to ram for this one. I'm getting the feeling that it's not extremely sharp to begin with. The Q4 XS of the 27B runs circles around it for mermaid diagrams and charts for instance. The data tells me the Q5 isn't too far behind, but it keeps feeling a bit... Off?
I'll have the results later today, but unless you really need the speed, I might recommend offloading experts and keep to Q5. PS. @AesSedai beat out UD Q5_K_XL, but it is a bit bigger.
If they're paying attention to this thread, might I suggest a fused Q6 option? It could deliver near Q8 quality while out speeding UD Q6_k_xl
edit:
I would stay away from Q4
@espen96 I am paying attention to this thread :) Thanks for doing this testing!
I chose Q5_K_M / Q4_K_M / IQ4_XS / IQ3_S because I was looking for a reasonable number of size breakpoints in the mid-bpw range (~3-6bpw). The Q6-ish level gets into "might as well use Q8" territory IMO because my MoE quants follow this pattern:
- Pick a quantization level for
ffn(up|gate) - Pick a +1 quantization level for
ffn_down - Pick a high-quality default ftype
So for Q5_K_M that looks like: Q5 for the up and gate, Q6 for the down, and Q8 for the default ftype.
What would a Q6_K_M look like? Probably Q6 for the up and gate, Q6 or Q8 for the down, and Q8 for the default ftype. At that point you're close enough to just use the Q8 IMO, unless I did just Q6 across the board for the up/gate/down maybe. I'm not ruling it out 100% if someone really wants it.
If they're paying attention to this thread, might I suggest a fused Q6 option?
Interestingly, some early testing by ik suggests that ik_llama.cpp's q6_0 might be a good type given its ~6.5BPW as well as 32 block size (most of the i quant and k quant types are 256 block size). Details on that experiment here: https://github.com/ikawrakow/ik_llama.cpp/issues/1471#issuecomment-4097526398
I'll play around with it some.
Oh, I didn't even realize mainline was missing a q6_0. Interesting finding, this model keeps bringing new insight eh? It does seem like one might need more quant formats to really optimize for each model. Quantizing with the standard recipie and an imatrix might not be good enough at all going forward, and the existing options might be leaving quite a bit of optimization on the table.
if 6_0 is looking good, it might be worth looking into the other legacy quants? or is 6_0 special? You got a lot more insight there than me
Quantizing with the standard recipie and an imatrix might not be good enough at all going forward, and the existing options might be leaving quite a bit of optimization on the table.
We've already seen many of the quantizers move away from the original standard recipe mixes by either creating their own branch with recipes stored in the llama-quantize code ala unsloth and bartowski, or like myself and AesSedai using custom recipes with ik's llama-quantize --custom-q or mainline'sllama-quantize --tensor-type.
So at this point I don't think we're leaving too much on the table, especially taking into consideration so many different backend hardware and kernel implementations makes it hard to optimize without trade-offs.
if 6_0 is looking good, it might be worth looking into the other legacy quants? or is 6_0 special?
So at the moment q6_0 on exists on ik_llama.cpp, and doesn't exist on mainline. All the legacy quants are 32 block size similar to iq4_nl. The main advantage they have is they have been around longer and are fairly simple complexity tend to be faster to compute for PP especially on Vulkan backends etc. I've experimented some with q4_1 5bpw as well which exists on mainline too.
I do believe mainline has q5_0 5.5bpw and and q5_1 6.0 bpw which could be interesting, but I've not tried them so much. Might be good for mac/vulkan folks.
I uploaded another ik_llama.cpp quant experimenting a bit and added a very minimal KLD graph too: IQ5_KS 18.532 GiB (5.919 BPW) https://huggingface.co/ubergarm/Qwen3.5-27B-GGUF#iq5_ks-18532-gib-5919-bpw
We've already seen many of the quantizers move away from the original standard recipe mixes by either creating their own branch with recipes stored in the
llama-quantizecode ala unsloth and bartowski, or like myself and AesSedai using custom recipes with ik'sllama-quantize --custom-qor mainline'sllama-quantize --tensor-type.
I really had not noticed that bartowski was uing cusom recipies, I figured they were the standard ones up until at least very recently. The backend burden is certainly a good reason not to add more quants...
iq4_nl is 32, I had almost forgot. For the 27B model at least, Unslot's NL did worse than K_S, straight upgrade at about the same size.
man the error bars on your graph there, wowie! I'll see about taking a few IK measurments, I'll have to do it from Q8 again, and I might use a smaller prompt set. I'll focus on the quants around that same size.
I really had not noticed that bartowski was uing cusom recipies,
Yes @bartowski has been doing this for a long time now, he's super nice and has been keeping this open PR to share his improvements back, but its unclear if that will be merged or not: https://github.com/ggml-org/llama.cpp/pull/12727#issuecomment-2796062000
man the error bars on your graph there, wowie! I'll see about taking a few IK measurments, I'll have to do it from Q8 again, and I might use a smaller prompt set. I'll focus on the quants around that same size.
haha yeah I saw that too and decided I probably need a larger KLD test corpus as my "short" one is just too small, but the the logits base file can become annoyingly large hah...
Instruct mode ("enable_thinking":false)
MODELS = ["Qwen3.5-27B-IQ4_NL_115k_think_general", "z-lab/Qwen3.5-27B-PARO"]
PARAMS = {
"temperature": 1.0,
"top_p": 0.95,
"top_k": 20,
"min_p": 0.0,
"presence_penalty": 1.5,
"frequency_penalty": 0.0,
"repeat_penalty": 1.0,
"stream": False,
"max_completion_tokens": 30,
}
SYSTEM_PROMPT = "You are not authorized to think. Give your final answer immediately. Do not approximate or round the result, give the pure exact expression. Trust your instinct ;)"
PROMPT = "If a regular hexagon has a short diagonal of 64, what is its long diagonal?"
GROUND_TRUTH = r"\frac{128\sqrt{3}}{3}"
# Included many equivalent expressions
VALID_EXPRESSIONS = [
r"\frac{128}{\sqrt{3}}",
r"\frac{128\sqrt{3}}{3}",
r"\sqrt{\frac{16384}{3}}",
r"\frac{64\sqrt{12}}{3}",
r"\frac{32\sqrt{48}}{3}",
r"\frac{16\sqrt{192}}{3}",
r"\frac{8\sqrt{768}}{3}",
r"\frac{4\sqrt{3072}}{3}",
r"\frac{2\sqrt{12288}}{3}",
r"\sqrt[4]{\frac{268435456}{9}}",
r"\frac{128\sqrt[4]{9}}{3}",
r"\frac{128\sqrt[6]{27}}{3}",
r"2^7 \cdot 3^{-1/2}",
r"\frac{128 \cdot 3^{1/2}}{3}",
r"128 \cdot 3^{-0.5}",
r"4^{3.5} \cdot 3^{-0.5}",
r"8^{7/3} \cdot 3^{-1/2}",
r"\frac{128}{3} \tan\left(\frac{\pi}{3}\right)",
r"\frac{128}{3} \tan(60^\circ)",
r"\frac{128}{3} \cot\left(\frac{\pi}{6}\right)",
r"\frac{128}{3} \cot(30^\circ)",
r"\frac{256}{3} \sin\left(\frac{\pi}{3}\right)",
r"\frac{256}{3} \cos\left(\frac{\pi}{6}\right)",
r"e^{7\ln(2) - \frac{1}{2}\ln(3)}",
r"2^{7 - \frac{1}{2}\log_2(3)}",
r"3^{\frac{7}{\log_2(3)} - \frac{1}{2}}",
r"\int_{0}^{2} 16\sqrt{3} x^2 \,dx",
r"\int_{0}^{\pi/3} \frac{128}{3} \sec^2(x) \,dx",
]
Don't ask me why I did hardcoded everything instead of asking a specific format to the LM and tested through compute lol, a lot of regex have been involved I guess I want the LMs to feel free I think I anthropomorphize them too much :D
Results:
Qwen3.5-27B-IQ4_NL_115k_think_general: 92/3000 correct (3.1%)
z-lab/Qwen3.5-27B-PARO: 39/3000 correct (1.3%)
I'm really glad to see this, because using vllm is a pain for a single user setup compared to llama.cpp!
I'll share more nano benchmarks on specific prompts when I have new ideas. I would like to try more things with the reasoning mode on but that's so expensive and time consuming I think I'll need to target something great if I do again!
PS: in reasoning mode they nail it easily, for 20 quick runs each, both scored 20/20, but actually many models pass this once you allows more tokens
@owao that is similar to my knowledge benchmark.
However, it does a forced continuation of a carefully worded sentence. It leads into a beamsearch where it looks at the tokens that meet critera for checking, and follow those paths to see if we find the valid answers.
The ranks, the amount of dead ends, proability.. etc. It can then tell us if the model knows the fact, if it is uncertain... etc. it tries to avoid shutting down formatting etc.
Hardcoding so many expressions is... you might run into edge cases? As much as I hate judges, this feels like it might need a judge to verify that incorrect responses are infact not a valid way to express the answer.
I see, yeah it's far more comprehensive and thought that my script kiddie vibe experiment! Why you force continuation instead of free answer? Is it to maximize the number of valid samples (whether correct or incorrect) by limiting the answers space? Clearly collecting the probs token after token is far richer, but yet requires far more work, that you did! But I know wasted a ton of compute for poor data collection...
For the edge cases, I have actually been lucky because all the answers have been covered and parsed as expected. So the results, while being only focused on final answer, are still valid :)
So when forcing continuation when in thinking mode, you prefill <think> and the beginning of the thoughts without closing with </think>?
So when forcing continuation when in thinking mode, you prefill <think> and the beginning of the thoughts without closing with </think>?
It's usually not a problem. For that knowledge measurement, it's the next word or so, it would take a lot for the next continuation to completely break the sentence
I see, yeah it's far more comprehensive and thought that my script kiddie vibe experiment! Why you force continuation instead of free answer? Is it to maximize the number of valid samples (whether correct or incorrect) by limiting the answers space? Clearly collecting the probs token after token is far richer, but yet requires far more work, that you did! But I know wasted a ton of compute for poor data collection...
If you want to know of the model knows something factual, like... The capital of France.
You can have it complete that sentence. You can look at the top 5-10 tokens, and see how many paths lead to Paris. This saved you from running the rest a bunch. You can see 5-10 possible continuations, and cull all the paths that do not seem to lead anywhere, or is too low probability. So you can find all the "P" paths, the direct "Paris" the markdown paths, newline... And so on, fairly cheaply. You then get an idea of the probability of actually getting the correct answer.
The raw probability can be lower than it should be, but you can look at the ranks of the valid paths, the amount of them and so on.
It's really cool .
What's better than asking a model 100 times? Just check if it ever had a chance of getting it right.
Just some illustration from Google. Mine has multiple valid options, and ways to attempt to not toss away formatting quirks, markdown etc.
Thanks! Yeah for knowledge it's far better than having to do multiple passes like I did ! But actually I was aiming at assessing latent reasoning rather than knowledge, so I guess that couldn't apply in my specific case. However, I realize I should have tried a problem with much better chances too be absent from the training data instead of the one I tried. The string reversal I tried before was actually better!
That said, assuming that this hexagon problem and solution was present in the pretraining, there is something unxepected, the Unsloth case:
==================================================
Evaluating model: bartowski/Qwen3.5-27B-IQ4_NL
==================================================
Resuming from existing results (3000 runs already completed)
All 3000 runs already completed
Results: 92/3000 correct (3.1%)
==================================================
Evaluating model: z-lab/Qwen3.5-27B-PARO
==================================================
Resuming from existing results (3000 runs already completed)
All 3000 runs already completed
Results: 39/3000 correct (1.3%)
==================================================
Evaluating model: Kbenkhaled/Qwen3.5-27B-NVFP4
==================================================
Resuming from existing results (3000 runs already completed)
All 3000 runs already completed
Results: 19/3000 correct (0.6%)
==================================================
Evaluating model: unsloth/Qwen3.5-27B-Q4_K_XL
==================================================
Resuming from existing results (3000 runs already completed)
All 3000 runs already completed
Results: 213/3000 correct (7.1%)
How is this possible? How could the calibration could influence the knowledge? Would it mean Unsloth actually finetune their models prior to the calibration?
No problem!
I am moving closer to somehting more understandable by most people. a bit of a preview:
it needs tuning, but this tries to take in all the data I have to look for signs of issues and so on, to help deliver some data that hopefully conveys the nightmare that is conveying all the stuff I have shown you guys. things look fine, but isn't. things can go poorly fast if one things goes wrong. SO these are metrics with no promises, but they try to convey somehting about the quants and the baseline. if it seems to be a good representation of the model, affected by all the stuff that I have done. Semantic data, disrruptions, 100% divergence... oddities in the data... and it tries to remove a lot of the potential noise or misunderstandings.
again, it needs tuning. The math is not done yet.
Thanks again for sharing the correct methodology @espen96 looking at the top tokens! I won't do the same error twice now :D
Not so much "correct" But llms are not "greedy" it is all probability, and they always have a chance of preferring weird paths or slightly different phrasings. So keeping that in mind helps
How is this possible? How could the calibration could influence the knowledge? Would it mean Unsloth actually finetune their models prior to the calibration?
I cannot imagine that unsloth finetune their models further, they likely use the same llama.cpp convert_hf_to_gguf.py to create a bf16 gguf from original model safetensors just like everyone.
The reason is pretty straight forward for all of the differences shown from what I can see: overall model size.
Assuming the recipes are all sane, then in general more BPW will give you more quality at the cost of TG speed usually.
Evaluating model: bartowski/Qwen3.5-27B-IQ4_NL
Results: 92/3000 correct (3.1%)
15.9GB
Evaluating model: unsloth/Qwen3.5-27B-Q4_K_XL
Results: 213/3000 correct (7.1%)
17.6GB
It is amusing the nvfp4 did the worst haha..
It is amusing the nvfp4 did the worst haha..
Yea, continuing to be a format optimized for training. I have no idea how post training quantization works for it however, or the rest of these formats. Do they not use importance or any weighting?
Oh no no sorry wait, I didn't plan to share it in the snippet! My bad, was just an experiment. I'm adding this disclaimer because I have an ampere card (RTX 3090) and it doesn't support nvfp4 natively so I was suspecting that it could come from this. But I actually don't know if this could affect accuracy or only performance.
The model was compressed using llm-compressor I wanted to try before realizing my GPU wasn't the best to run it. repo was Kbenkhaled/Qwen3.5-27B-NVFP4 (renamed to apolo13x/Qwen3.5-27B-NVFP4 since then)
The reason is pretty straight forward for all of the differences shown from what I can see: overall model size.
Yeah I get you. I guess the .DefaulCellStyle (thinking true) results were intriguing for unsloth/UD-Q4_K_XL simply because of too few sample (100)... Too expensive to run..
Currently running ministral instruct 8B and the results are insane. Either my quality formula is off for models of this size, or it is able to highlight the issues in the model itself.
Ministral was pruned and healed from Mistral Small.
until I run Llama 3 8B and Gemma 3 12B and maybe others, I can't say, but if this is right, then this tool of mine might have been able to spot the damage dealt by Mistral's pruning and healing.
I don't get why they pruned it from the already [pruned then healed] the intermediary 14B artifact instead of directly prune it from the 24B, would be interesting to see if the hypothesized damage would be that bad or less for the 14B, but anyway would be hard to tell if its because of the size (14B vs 8B) or the double damage+heal.
Quant talking, the disparity between the Q8 and Q8_XL for Culture Regional and Math Theory is really confusing...!
Probably a fluke. I wouldn't read much into the Q8 XL discrepancy right now. I'll look at it later.
I'll see later if I can't run the BF16 14B. But I'll do llama and Gemma first.
I'll never get to those long context tests will I? Haha.
I did get a new nvme drive, so I'll be able to run IK on Linux soon.
Once I've exhausted all my local resources, and I have something cleaner to share. I certainly hope someone here might have the ability to run something really large, DeepSeek sized. I'm very interested in seeing if the famous quantization stability of the model shows up here as well, or if something else shows up.
But one thing at a time!
Mistral, llama. Both look way off in quality vs stability. Meaning there's stuff that registrers as "off". Mainly 100% divergence events, they start at Q8. They also are not recoverable at first, not in the top 10 tokens.
Qwen 3.5 9B, mostly similar to the larger ones, but is a bit worse for wear, not quite as clean an overlap. 100% divergence starts at Q4 again.
I'll test Gemma, maybe ministral 14B and I'll attempt Mistral small 24b.
I am not quite sure what to think of all this.
Oh 😮 which Mistral and Llama was it? I assume 7b for mistral and 8B for llama then I'm really curious to see for a bigger one. And for Gemma which size do you plan to test?
MInistral 8B and Llama 8B. I'll run Gemma 3 12B.
Nice to have gemma as middle ground!
I was going to do Gemma 3 but Gemma 4 happened.
However, Gemma 4 26b is resisting me. It's not easy to measure, and the data is "off".
I can say, it has a third stability spot in Science, so not just math and code.
I believe it is actually quantizing well? But the numbers overall are too bad to say. The old Unsloth Q4 XL quant is easily beaten by Bartowski's latest K_L.
I'll hold off on further testing of it
Sorry I'm again linking to that kaitchup guy :) but I've been subscribing to their newsletter since last time actually to follow a bit their work and enjoyed reading it so far!
In free access we have this chart: https://kaitchup.substack.com/p/best-gemma-4-ggufs-evaluations-from
Too bad no other domains than code and STEM were tested! So far it is in line with what you found @espen96 ! But maybe you'll bring the other side of the picture :)
But at least that super nice to see we can go this low for coding assisting ! Good model to keep!
This model is giving me a headache.
oh my fucking god.
so to make sure I am not crazy, that my system is working. I ran qwen 9b, against qwen 9b. same quant, different name. and di the same with gemma 26b
it is not able to produce a deterministic result at all. it is not able to match itself!
Ran it on cpu only to make sure, and it is fine there at least!
I hope your hardware is not failing apart!
Is your server command pulling directly from the hub or is it using a local gguf file? At the point we are...
local. I am putting together a file to let people check for themselves.
no funny math here, just straight check to see how different things are.
If I can, I'll do!
https://github.com/ggml-org/llama.cpp/issues/21532
The html file with the verification setup is in the issue
fwiw regarding gemma-4 models, i've still not quantized it yet because it seems like various tokenizer issues which seem to effect imatrix calculation. so i don't want to release quants then have to re-do them because i need to re-compute imatrix again.
this comment on a recent PR highlights that the perplexity results have changed again: https://github.com/ggml-org/llama.cpp/pull/21500#issuecomment-4193534949
I looked at your GH issue there, might be useful to put the full llama-server ... command, though i do see you suggest it is using a fixed seed which is good.
Not sure which version of llama.cpp was used to create your test quants, might want to link exact quant used in your test along with that information as well (e.g. was it made with imatrix etc too).
yea, seems to be an unsloth issue perhaps. bartowski's quants from after he uploaded his fixes, after the main template and tokenizer changes, seems to be fine.
My unsloth quants are from when they pushed their updates seem to be borked, so from the 4th.
Seems the official ggml q4 is fine as well. the launch day bf16 from unsloth is also affected.
No clue about their current quants. They keep pushing updates without any changelog, who knows why or what changed, or when to grab new ones.
is it "template fixes"? new imatrix? new recipie? who knows!
Yea, would be quite a foul if my setup didn't use a fixed seed.
ok,, they found it. Issue with GEMV fusion.
https://github.com/ggml-org/llama.cpp/pull/21566
So all the BF and F16 quants I tried were affected, Unsloth, ggml, Bartowski. But naturally I only had one around to start. I grabbed Bartowski's after the F16 quant from ggml failed, to rule out issues with the format.
However, all quants appear to be working now, the ones I checked to verify the fix at the very least seem to be fine. Same result every time.
no way I can ever measure the 31B Gemma 4.
But:
https://localbench.substack.com/p/gemma-4-31b-gguf-kl-divergence
this seems like a very good writeup. I do wish they had more categories?
Coding
General chat
Tool calling
Science
Non-Latin scripts
Long documents
Great list, but I would have liked to see some pop culture stuff with focus on easier and harder data, but this is great. even got long documents! I have not gotten that far yet.
They got top-1 and kld. I would have liked to see more than just the top choice. LLMs are not greedy, got to pay attention to the top options, the top-1 is no good if the model is considering multiple tokens strongly.
So it does appear that the 31B, according to this, is quantizing as expected. everything is mostly fine until you drop into the knee where q4 quants start to reveal the damage, and it is just worse and worse from there on.
I had a feeling about somehting.
remember the 100% divergences?
here is some data on Gemma.
qwen 35b
ministral 8b.
llama 3.1 8b
Note one thing. The replacement tokens appear to come from some unknown place, complete stranger.
For Qwen and Gemma it is often a token with very very high probability.
This is using the "kneedle" algorithm to find the cutoff.
so this is saying, that most of the time the models own natural boundary, when it stops considering tokens. is breached, we are picking some "random" token the model never considered before.
And for Gemma, and qwen, the rank of the candidate we just replaced with some random token is, the top one.
it is not replaced by the runner up, or some other top 10 choice, no, some outsider.















































































