AesSedai commited on
Commit
9bbceea
·
verified ·
1 Parent(s): 9f16c00

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -0
README.md ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - zai-org/GLM-4.7
4
+ ---
5
+ This repo contains specialized MoE-quants for GLM-4.7. The idea being that given the huge size of the FFN tensors compared to the rest of the tensors in the model, it should be possible to achieve a better quality while keeping the overall size of the entire model smaller compared to a similar naive quantization. To that end, the quantization type default is kept in high quality (Q8_0 to Q5_K) and the FFN UP + FFN GATE tensors are quanted down along with the FFN DOWN tensors.
6
+
7
+ The mixture convention is as follows: `[Default Type]-[FFN_UP]-[FFN_GATE]-[FFN_DOWN]`, eg: `Q8_0-Q4_K-Q4_K-Q5_K`. This means:
8
+ - Q8_0 is the default type (attention, shared expert, etc.)
9
+ - Q4_K was used for the FFN_UP and FFN_GATE conditional expert tensors
10
+ - Q5_K was used for the FFN_DOWN conditional expert tensors
11
+
12
+ I've mapped these mixes to the closest BPW I could reasonably discern.
13
+
14
+ | Quant | Size | Mixture | PPL | KLD |
15
+ | :--------- | :--------- | :------- | :------- | :--------- |
16
+ | Q8_0 | 354.79 GiB (8.50 BPW) | Q8_0 | 8.6821 ± 0.15706 | 0 |
17
+ | Q5_K_M | 250.15 GiB (6.00 BPW) | Q8_0-Q5_K-Q5_K-Q6_K | 8.682378 ± 0.157101 | 0.011578 ± 0.000687 |
18
+ | Q4_K_M | 209.77 GiB (5.03 BPW) | Q8_0-Q4_K-Q4_K-Q5_K | 8.746787 ± 0.158456 | 0.017262 ± 0.000585 |
19
+ | IQ4_XS | 165.28 GiB (3.96 BPW) | Q8_0-IQ3_S-IQ3_S-IQ4_XS | 8.866443 ± 0.160719 | 0.043752 ± 0.001071 |
20
+ | IQ2_M | 107.12 GiB (2.57 BPW) | Q5_K-IQ2_XXS-IQ2_XXS-IQ3_XXS | 9.824880 ± 0.179312 | 0.194644 ± 0.003154 |
21
+
22
+ ![ppl_ratio_vs_kld](ppl_ratio_vs_kld.png "Chart showing PPL vs KLD analysis of quants")