I see the change in kld values is pretty modest vs prior version. Does anyone know how that translates to real world? Is more of a linear type situation or exponential etc
I see the change in kld values is pretty modest vs prior version. Does anyone know how that translates to real world? Is more of a linear type situation or exponential etc
Also, the benchmarks are because they messed up the first version of their Qwen 3.5 XL quants by quanting some tensors to mxfp4 that should have been in higher quality, and this is their bugfix. The post literally starts out with "We updated Qwen3.5-35B Unsloth Dynamic quants being SOTA on nearly all bits" without explaining WHY they needed to update from the original version.
No your conclusion is false - only the old Q4_K_XL had slightly higher perplexity, all other quants are fine. We uploaded 9TB of research artifacts to https://huggingface.co/unsloth/Qwen3.5-35B-A3B-Experiments-GGUF for the community.
If you read our blog, it says KLD and PPL are actually sometimes counterintuitive - for example MiniMax some of our quants do worse on PPL and KLD vs AesSedai's one for example, but does worse on LiveCodeBench by a lot see https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks#id-3-perplexity-and-kld-can-be-misleading
This is because see https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks#id-1-some-tensors-are-very-sensitive-to-quantization - although bitwidths are in general monotonic ie q2_k < q3_k < q4_k < q5_k etc, we find KLD and PPL are actually not monotonic ie q3_k can actually have BETTER PPL than q4_k.
So the main point is bad luck on quantization - sometimes lower bits might get lower PPL and KLD, but actually this is a ruse and wrong, since on actual real world tasks, it's worse.
So then why was Q4_K_XL having issues? Is it just a PPL issue that doesn't reflect in real world usage? If yes, why not just say that? "The Q4_K_XL had lower PPL, but don't worry, PPL can be wrong, and other benchmarks show it's fine". If it was a real quality issue, then where was the issue caused by?
The blog post says "Retiring MXFP4 from all GGUF quants: Q2_K_XL, Q3_K_XL and Q4_K_XL, except for pure MXFP4_MOE" but doesn't say why. The easy assumption that most people would make is "oh, you quanted attention or ssn or something to mxfp4 and that turned out to be bad, so you retire mxfp4" but if you say that it's not that, then what's the actual issue?
Unsloth Q4_K_M
PPL: 6.6053 KLD 99.9%: 0.5478 KLD mean: 0.0192
bartowski Qwen_Q4_K_M
PPL: 6.6097 KLD 99.9%: 0.5771 KLD mean: 0.0182
Barely noticeable drop in PPL; noticeable KLD drop (good, 5%); but worse KLD mean (bad, 5%).So the KL divergence numbers here are more useful to me than the MMLU tables honestly. I've had MMLU hold steady while the output distribution drifted enough to break things downstream.
Does the calibration dataset make much difference at 3B though? There's so little redundancy that I'd expect it to hit a floor pretty fast regardless of how good the calibration data is.
Any HN model recommendations to run on my 24GB M5 and any best practices while running them?
With the Qwen3.5 35B A3B at Q4 I've got 200k context running at 62.98 tokens per second on a local RTX5080 16GB.
Old 2/24 Q4_K_XL commit (pre bugfix files): https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF/commit/7a8e0b23fcaf1a052ad02eb73f1c0627177e8325
Questions for a postmortem that the blog post left unanswered:
- Why the change? Is it just to improve PPL/KLD? Sure, we can assume PPL and KLD are not perfect benchmarks. If yes, then why change the quantization anyways? Or was the old 2/24 quant actually much worse performing in the real world?I presume the Q4_K_XL quant using mxfp4 was the issue? If the 2/24 files having a lower PPL is an actual issue due to low quality tensors, then why not just say that?
- What were the main tensors that had the quantizations changed from 2/24 to 2/27? Did you now quantize attention tensors differently? Or perhaps ssm? T
- What was it changed from? Was it changed from mxfp4 or q4_k to q8, or something else?
A quick sentence in the blog post saying "ok, we've confirmed that using mxfp4 (or q3 or whatever) in the attention/ssm/biases/norms/etc is a bad idea, we had that in our old models on 2/24 and our new models today are better" that would make it clear. As it's written, it's trying to both say "PPL/KLD don't actually reflect real world quality" and "we changed our quant to increase PPL/KLD" at the same time, which seems contradictory.
There's some experiments of just removing or merging experts post training to shrink models even more https://bknyaz.github.io/blog/2026/moe/
Now shrinking them sure, but I’ve seen nothing that indicates you can just page weights in and out without cratering your performance like you would with a non MoE model
I’m eager to try it out, especially if 16GB is viable now.
Any resources for configuring the local setup?
My entire home media stack is a single compose file in a WSL distro so it would be cool if local LLM worked the same way.
You can just load the Q4_K_XL model like normal, and put all tensors on GPU without any -ot or --cpu-moe flags.
If you need a massive context for some reason where model+kv cache won't fit in 32gb, then use -ot to move the ffn moe experts for 1-2 layers into RAM. You'll get a speed hit (due to loading params from slower RAM instead of fast VRAM) but it'll work.
But mmmmmm, Q8_K_XL looks mighty nice.
New model archs usually involve code changes.