Qwen3.5-0.8B
Selected quantizations of Qwen3.5-0.8B in GGUF format, based on the unquantized version of this model from Unsloth (using Unsloth imatrix and incorporating Unsloth chat template fixes), but modified with higher precision for specific tensors to reduce quantization-induced deviations.
The changes affect the following tensors, which were explicitly quantized with Q8_0:
*.attn_gate.weight*.attn_qkv.weight*.ssm_out.weight
Due to the model architecture of Qwen3.5, low precision in these specific tensors apparently has a disproportionately strong negative impact on typical quality metrics for quantized models (perplexity, KLD). Therefore, it can be assumed that increasing the precision for these tensors should lead to higher accuracy and fewer errors.
This idea is borrowed from the Qwen3.5-35B-A3B quantizations of AesSedai. Please see also the excellent documentation from Unsloth regarding the quantization effects to different tensor types for Qwen3.5 at https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks#id-1-some-tensors-are-very-sensitive-to-quantization.
Even though Unsloth's measurements suggest that q6_k is the sweet spot regarding memory consumption and KLD deviation for the sensible tensors, I nevertheless decided on q8_0 because my benchmarks for execution speed on CPU-only with the current version of llama.cpp show that performance, particularly during prompt processing with selective q8_0 tensors, is better than with q6_k. This is merely a snapshot for the current version of llama.cpp and the specific case of CPU-only inference.
KLD metrics
Die folgenden Metriken wurden mit llama-perplexity auf dem wiki.test.raw Datensatz erstellt. Als Referenz dient dabei die unquantisierte Version von Qwen3.5-0.8B von Unsloth.
Q4_K_M
| KLD metric | Unsloth | this |
|---|---|---|
| Mean KLD | 0.036272 ± 0.000154 | 0.029751 ± 0.000104 |
| Maximum KLD | 6.372144 | 2.502058 |
| 99.9% KLD | 0.576699 | 0.404997 |
| 99.0% KLD | 0.198280 | 0.154058 |
| 95.0% KLD | 0.100348 | 0.081550 |
| 90.0% KLD | 0.073783 | 0.060891 |
| Median KLD | 0.025968 | 0.021944 |
| 10.0% KLD | 0.002530 | 0.002113 |
| 5.0% KLD | 0.000720 | 0.000598 |
| 1.0% KLD | 0.000076 | 0.000064 |
| 0.1% KLD | 0.000009 | 0.000007 |
Q5_K_M
| KLD metric | Unsloth | this |
|---|---|---|
| Mean KLD | 0.013468 ± 0.000055 | 0.011024 ± 0.000037 |
| Maximum KLD | 2.288366 | 1.584060 |
| 99.9% KLD | 0.212526 | 0.144384 |
| 99.0% KLD | 0.070535 | 0.055442 |
| 95.0% KLD | 0.036214 | 0.029511 |
| 90.0% KLD | 0.027000 | 0.022165 |
| Median KLD | 0.009949 | 0.008428 |
| 10.0% KLD | 0.000949 | 0.000799 |
| 5.0% KLD | 0.000251 | 0.000211 |
| 1.0% KLD | 0.000028 | 0.000023 |
| 0.1% KLD | 0.000003 | 0.000002 |
Q6_K
| KLD metric | Unsloth | this |
|---|---|---|
| Mean KLD | 0.006145 ± 0.000022 | 0.005707 ± 0.000020 |
| Maximum KLD | 1.056829 | 1.131829 |
| 99.9% KLD | 0.080446 | 0.070042 |
| 99.0% KLD | 0.030629 | 0.028008 |
| 95.0% KLD | 0.016255 | 0.014990 |
| 90.0% KLD | 0.012254 | 0.011357 |
| Median KLD | 0.004721 | 0.004436 |
| 10.0% KLD | 0.000461 | 0.000437 |
| 5.0% KLD | 0.000121 | 0.000114 |
| 1.0% KLD | 0.000013 | 0.000011 |
| 0.1% KLD | 0.000000 | 0.000000 |
Quantization steps
The following outlines the steps I took to create these quantizations. They serve as a reference for my future use, but are also intended to be a helpful example or guide for others.
Clone huggingface unsloth/Qwen3.5-0.8B
The unquantized model from Unsloth serves as the starting point for the quantization. We are explicitly not using the unquantized model from Qwen at this stage because Unsloth frequently implements improvements and bug fixes that enhance the work with the model and its quality, such as updates to the chat template or the fundamental model configuration.
mkdir -p ~/huggingface/unsloth
cd ~/huggingface/unsloth
git clone git@hf.co:unsloth/Qwen3.5-0.8B
After this step, the unquantized model is located in the directory ~/huggingface/unsloth/Qwen3.5-0.8B.
Clone huggingface unsloth/Qwen3.5-0.8B-GGUF (without large files)
Next, we clone the repository for the pre-quantized versions of the model from Unsloth. We are not interested in the actual quantizations themselves, but rather only in the imatrix calibration file. We will later use the same imatrix as Unsloth to quantize the model.
First, the repository is cloned without downloading the large model files or similar data. At this stage, only pointers for the large files are created in the directory.
cd ~/huggingface/unsloth
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/unsloth/Qwen3.5-0.8B-GGUF
Selectively pull the (large) imatrix_unsloth.gguf_file imatrix calibration file to be used be the quantization later on.
cd ~/huggingface/unsloth/Qwen3.5-0.8B-GGUF
git lfs pull --include="imatrix_unsloth.gguf_file"
Download and compile llama.cpp
It is likely a good idea to always work with the latest version of llama.cpp when quantizing models, especially when dealing with new or very recent models. For this reason, llama.cpp is cloned and built from source in this step.
First, the necessary packages must be installed to be able to build software from source. For Ubuntu Linux, for example, it would look like this.
sudo apt-get install -y --no-install-recommends \
build-essential \
cmake \
git \
python3 \
python3-venv \
python3-pip \
ca-certificates \
libxcb-cursor0
Next, the official llama.cpp repository will be cloned and built.
mkdir -p ~/github/ggml-org
cd ~/github/ggml-org
git clone https://github.com/ggml-org/llama.cpp.git
cmake -S ./llama.cpp -B ./llama.cpp/build
cmake --build ./llama.cpp/build --config Release
Install necessary requirements in Python virtual environment
To convert the unquantized model from its original format (safetensors) to GGUF, we need a script from the llama.cpp repository. This is a Python script that requires the installation of several dependencies before use.
python3 -m venv .venv
source .venv/bin/activate
python -m pip install -U pip wheel
python -m pip install -r ~/github/ggml-org/llama.cpp/requirements.txt
python -m pip install --upgrade transformers # or else convert_hf_to_gguf.py throws warnings about unrecognized model
python -m pip install PySide6 # only necessary for graphical llama.cpp/gguf-py/gguf/scripts/gguf_editor_gui.py
exit
Convert original safetensors model to GGUF (without applying quantization)
The original model can now be converted to GGUF. No quantization is performed at this stage; this will take place in the next step.
source .venv/bin/activate
python ~/github/ggml-org/llama.cpp/convert_hf_to_gguf.py \
~/huggingface/unsloth/Qwen3.5-0.8B \
--outfile Qwen3.5-0.8B-unquantized.gguf
exit
Quantize model with specific tensor weights using Unsloth imatrix
Now the actual quantization takes place. For this, we are using the imatrix calibration from Unsloth for Qwen3-5-0.8B. Additionally, we specify in the tensor_types.txt file that we want certain tensors quantized with a higher level of precision than the standard.
The content of the tensor_types.txt file looks roughly like this:
tensor_types.txt
.*\.attn_gate\.weight$=q8_0
.*\.attn_qkv\.weight$=q8_0
.*\.ssm_out\.weight$=q8_0
Assuming the steps have been carried out as described in the instructions, the unquantized GGUF version of the model is currently located in the working directory, as is the tensor_types.txt file. The imatrix calibration file for the model is located in the ~/huggingface/unsloth/Qwen3.5-0.8B-GGUF directory and is named imatrix_unsloth.gguf_file.
Therefore, the command to quantize the model to Q4_K_M is as follows:
~/github/ggml-org/llama.cpp/build/bin/llama-quantize \
--tensor-type-file tensor_types.txt \
--imatrix ~/huggingface/unsloth/Qwen3.5-0.8B-GGUF/imatrix_unsloth.gguf_file \
Qwen3.5-0.8B-unquantized.gguf \
Qwen3.5-0.8B-Q4_K_M.gguf \
Q4_K_M
Measure perplexity for unquantized model and record base data for KLD calculation
To calculate metrics such as KLD, we require a reference value generated based on the unquantized model and a test dataset. Typically, wiki.test.raw is used as the test dataset for this purpose. The baseline values, which will later be used to determine the respective deviations for the quantizations, are stored in Qwen3.5-0.8B-unquantized.kld. Please note that this file can become extremely large.
~/github/ggml-org/llama.cpp/build/bin/llama-perplexity -t 8 \
-m Qwen3.5-0.8B-unquantized.gguf \
-f wikitext-2-raw/wiki.test.raw \
--kl-divergence-base Qwen3.5-0.8B-unquantized.kld
Now, values such as perplexity or KLD can be derived in comparison to the base model. Here is an example of generating these metrics for the Q4_K_M quantization.
~/github/ggml-org/llama.cpp/build/bin/llama-perplexity -t 8 \
-m Qwen3.5-0.8B-Q4_K_M.gguf \
-f wikitext-2-raw/wiki.test.raw \
--kl-divergence-base Qwen3.5-0.8B-unquantized.kld \
--kl-divergence
- Downloads last month
- 41
4-bit