--- license: apache-2.0 base_model: Qwen/Qwen2.5-Coder-32B-Instruct tags: - bitnet - quantization - ternary - 1.58-bit - qwen - qwen2.5 - code - experimental - 32b-architecture library_name: safetensors pipeline_tag: text-generation language: - en - zh model_name: Qwen2.5-Coder-32B-BitNet-1.58b datasets: [] metrics: [] --- # Qwen2.5-Coder-32B-Instruct-BitNet-1.58b **Architecture: 32 Billion Parameters** | BitNet 1.58-bit Ternary Quantization --- > **IMPORTANT: Parameter Count Display** > > HuggingFace displays "9B params" because it counts packed bytes, not actual parameters. > This model has the **full 32B parameter Qwen2.5-Coder architecture**. > The weights are stored as ternary values ({-1, 0, +1}) packed 4 per byte, which reduces > storage to 9.6 GB but preserves all 32 billion parameters. --- ## Overview This is an **experimental** BitNet 1.58-bit quantization of the Qwen2.5-Coder-32B-Instruct model using absmean scaling with group-wise quantization. The model stores weights as ternary values ({-1, 0, +1}) packed 4 values per byte. **This is research/experimental work. Quality and performance have not been formally benchmarked.** ## Specifications | Property | Value | |----------|-------| | Base Model | [Qwen/Qwen2.5-Coder-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct) | | Architecture | Qwen2 (Qwen2ForCausalLM) | | Parameters | 32B (full architecture preserved) | | Quantization | BitNet 1.58-bit ternary | | Bits per Weight | ~1.58 | | Group Size | 64 | | Original Size | 65.53 GB (BF16) | | Quantized Size | 9.6 GB (SafeTensors) | | GGUF Size | 11 GB (TQ2_0) | | Compression | ~6.4x | ## Formats | Format | File | Description | |--------|------|-------------| | SafeTensors | `model-*.safetensors` | Sharded quantized weights + scales | | GGUF | `qwen2.5-coder-32b-TQ2_0.gguf` | llama.cpp TQ2_0 format (experimental) | > **GGUF Compatibility Note**: The GGUF conversion is experimental. Our BitNet quantization uses group size 64, while TQ2_0 uses 256-element blocks. This may cause compatibility issues with some inference engines. The SafeTensors format is the primary supported format. ## Quantization Method ### Algorithm 1. Reshape weights into groups of 64 2. Compute per-group scale: `scale = mean(|weights|)` 3. Normalize and round to nearest ternary: `q = round(w / scale)` clamped to {-1, 0, +1} 4. Map to unsigned: {-1, 0, +1} → {0, 1, 2} 5. Pack 4 values per byte: `v0 + v1*3 + v2*9 + v3*27` ### Tooling - **Quantization**: Custom Rust tool using [Candle](https://github.com/huggingface/candle) - **GGUF Conversion**: [llama.cpp](https://github.com/ggerganov/llama.cpp) convert_hf_to_gguf.py ### Hardware Used - GPU: NVIDIA RTX 5080 (16GB VRAM) - Quantization time: ~369 seconds (streaming mode) - Memory: Streaming mode with CPU fallback for large tensors (>3GB threshold) ## Usage ### With Ollama/llama.cpp (experimental) ```bash # llama.cpp (GGUF format - experimental, may have issues) ./llama-cli -m qwen2.5-coder-32b-TQ2_0.gguf -p "Write a Python function:" ``` ### Unpacking Weights (Python) ```python def unpack_ternary(packed_byte): """Unpack 4 ternary values from byte.""" values = [] val = packed_byte for _ in range(4): values.append((val % 3) - 1) # {0,1,2} → {-1,0,+1} val //= 3 return values ``` ## Limitations - **Quality not benchmarked** - May have significant degradation vs original - **Requires custom runtime** - Standard transformers doesn't support ternary weights - **Experimental** - Not intended for production use without evaluation - GGUF keeps embeddings/lm_head at F16, hence larger than SafeTensors - HuggingFace may show incorrect param count due to packed storage ## License Apache 2.0 (inherited from [Qwen2.5-Coder-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct)) ## Citation ```bibtex @misc{qwen-coder-32b-bitnet-2025, title={Qwen2.5-Coder-32B-BitNet-1.58b: Experimental BitNet Quantization}, author={Tzervas}, year={2025}, url={https://huggingface.co/tzervas/qwen2.5-coder-32b-bitnet-1.58b} } ```