Configuration Parsing Warning:Invalid JSON for config file config.json
Nemotron-3-Nano-30B-A3B GLQ 4-bit
NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16 quantized to 4 bits per weight using GLQ.
This is a hybrid Mamba-Attention-MoE architecture (30B total, ~3B active parameters).
Usage
pip install glq>=0.2.7 mamba-ssm causal-conv1d
import glq.hf_integration
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"xv0y5ncu/Nemotron-3-Nano-30B-A3B-GLQ-4bpw",
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"xv0y5ncu/Nemotron-3-Nano-30B-A3B-GLQ-4bpw",
trust_remote_code=True,
)
inputs = tokenizer("The theory of relativity states that", return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Quantization Details
- Method: GLQ (E8 lattice codebook + RHT + LDLQ error feedback)
- Bits per weight: 4 (uniform)
- Calibration: 128 samples from WikiText-2, sequence length 2048
- Quantized sublayers: 6,004 (128 MoE experts per layer + attention + MLP)
- Average SQNR: 21.64 dB
- Model size: 27.7 GB (vs ~60 GB at bf16)
- VRAM: 27,677 MB on NVIDIA L40S
- Quantization time: 41 minutes on L40S with
--streaming
Requirements
glq>=0.2.7mamba-ssmandcausal-conv1d(for Mamba layers)trust_remote_code=True(custom architecture)- CUDA GPU
License
NVIDIA Open Model License (same as base model). See license.
- Downloads last month
- 928
Model tree for xv0y5ncu/Nemotron-3-Nano-30B-A3B-GLQ-4bpw
Base model
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16