Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,22 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
base_model:
|
| 3 |
+
- zai-org/GLM-4.7
|
| 4 |
+
---
|
| 5 |
+
This repo contains specialized MoE-quants for GLM-4.7. The idea being that given the huge size of the FFN tensors compared to the rest of the tensors in the model, it should be possible to achieve a better quality while keeping the overall size of the entire model smaller compared to a similar naive quantization. To that end, the quantization type default is kept in high quality (Q8_0 to Q5_K) and the FFN UP + FFN GATE tensors are quanted down along with the FFN DOWN tensors.
|
| 6 |
+
|
| 7 |
+
The mixture convention is as follows: `[Default Type]-[FFN_UP]-[FFN_GATE]-[FFN_DOWN]`, eg: `Q8_0-Q4_K-Q4_K-Q5_K`. This means:
|
| 8 |
+
- Q8_0 is the default type (attention, shared expert, etc.)
|
| 9 |
+
- Q4_K was used for the FFN_UP and FFN_GATE conditional expert tensors
|
| 10 |
+
- Q5_K was used for the FFN_DOWN conditional expert tensors
|
| 11 |
+
|
| 12 |
+
I've mapped these mixes to the closest BPW I could reasonably discern.
|
| 13 |
+
|
| 14 |
+
| Quant | Size | Mixture | PPL | KLD |
|
| 15 |
+
| :--------- | :--------- | :------- | :------- | :--------- |
|
| 16 |
+
| Q8_0 | 354.79 GiB (8.50 BPW) | Q8_0 | 8.6821 ± 0.15706 | 0 |
|
| 17 |
+
| Q5_K_M | 250.15 GiB (6.00 BPW) | Q8_0-Q5_K-Q5_K-Q6_K | 8.682378 ± 0.157101 | 0.011578 ± 0.000687 |
|
| 18 |
+
| Q4_K_M | 209.77 GiB (5.03 BPW) | Q8_0-Q4_K-Q4_K-Q5_K | 8.746787 ± 0.158456 | 0.017262 ± 0.000585 |
|
| 19 |
+
| IQ4_XS | 165.28 GiB (3.96 BPW) | Q8_0-IQ3_S-IQ3_S-IQ4_XS | 8.866443 ± 0.160719 | 0.043752 ± 0.001071 |
|
| 20 |
+
| IQ2_M | 107.12 GiB (2.57 BPW) | Q5_K-IQ2_XXS-IQ2_XXS-IQ3_XXS | 9.824880 ± 0.179312 | 0.194644 ± 0.003154 |
|
| 21 |
+
|
| 22 |
+

|