--- license: mit base_model: - zai-org/GLM-4-32B-0414 datasets: - mit-han-lab/pile-val-backup pipeline_tag: text-generation tags: - gptq - vllm - llmcompressor - text-generation-inference --- # GLM-4-32B-0414 Quantized with GPTQ (4-Bit weight-only, W4A16) This repo contains GLM-4-32B-0414 quantized with asymmetric GPTQ to 4-bit to make it suitable for consumer hardware. The model was calibrated with 2048 samples of max sequence length 4096 from the dataset [`mit-han-lab/pile-val-backup`](https://huggingface.co/datasets/mit-han-lab/pile-val-backup). This is my very first quantized model, I welcome suggestions. 2048/4096 were chosen over the default of 512/2048 to minimize overfitting risk and maximize convergence. They also happen to fit in my GPU. Original Model: - [zai-org/GLM-4-32B-0414](https://huggingface.co/zai-org/GLM-4-32B-0414) ## 📥 Usage & Running Instructions The model was tested with vLLM, here is a script suitable for 32GB VRAM GPUs. ``` export MODEL="mratsim/GLM-4-32B-0414.w4a16-gptq" vllm serve "${MODEL}" \ --served-model-name glm-4-32b \ --gpu-memory-utilization 0.90 \ --enable-prefix-caching \ --enable-chunked-prefill \ --max-model-len 130000 \ --max_num_seqs 256 \ --generation-config "${MODEL}" \ --enable-auto-tool-choice --tool-call-parser pythonic \ --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' ``` ## 🔬 Quantization method The llmcompressor library was used with the following recipe for asymmetric GPTQ: ```yaml default_stage: default_modifiers: GPTQModifier: dampening_frac: 0.005 config_groups: group_0: targets: [Linear] weights: {num_bits: 4, type: int, symmetric: false, group_size: 128, strategy: group, dynamic: false, observer: minmax} ignore: [lm_head] ``` and calibrated on 2048 samples, 4096 sequence length of [`mit-han-lab/pile-val-backup`](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)