Configuration Parsing Warning:Invalid JSON for config file config.json

Nemotron-3-Nano-30B-A3B GLQ 4-bit

NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16 quantized to 4 bits per weight using GLQ.

This is a hybrid Mamba-Attention-MoE architecture (30B total, ~3B active parameters).

Usage

pip install glq>=0.2.7 mamba-ssm causal-conv1d
import glq.hf_integration
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "xv0y5ncu/Nemotron-3-Nano-30B-A3B-GLQ-4bpw",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "xv0y5ncu/Nemotron-3-Nano-30B-A3B-GLQ-4bpw",
    trust_remote_code=True,
)

inputs = tokenizer("The theory of relativity states that", return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Quantization Details

  • Method: GLQ (E8 lattice codebook + RHT + LDLQ error feedback)
  • Bits per weight: 4 (uniform)
  • Calibration: 128 samples from WikiText-2, sequence length 2048
  • Quantized sublayers: 6,004 (128 MoE experts per layer + attention + MLP)
  • Average SQNR: 21.64 dB
  • Model size: 27.7 GB (vs ~60 GB at bf16)
  • VRAM: 27,677 MB on NVIDIA L40S
  • Quantization time: 41 minutes on L40S with --streaming

Requirements

  • glq>=0.2.7
  • mamba-ssm and causal-conv1d (for Mamba layers)
  • trust_remote_code=True (custom architecture)
  • CUDA GPU

License

NVIDIA Open Model License (same as base model). See license.

Downloads last month
928
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for xv0y5ncu/Nemotron-3-Nano-30B-A3B-GLQ-4bpw

Finetuned
(34)
this model