Configuration Parsing Warning:Invalid JSON for config file config.json

Nemotron-3-Nano-30B-A3B GLQ 4-bit

NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16 quantized to 4 bits per weight using GLQ.

This is a hybrid Mamba-Attention-MoE architecture (30B total, ~3B active parameters).

Usage

pip install glq>=0.2.7 mamba-ssm causal-conv1d

import glq.hf_integration
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "xv0y5ncu/Nemotron-3-Nano-30B-A3B-GLQ-4bpw",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "xv0y5ncu/Nemotron-3-Nano-30B-A3B-GLQ-4bpw",
    trust_remote_code=True,
)

inputs = tokenizer("The theory of relativity states that", return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Quantization Details

Method: GLQ (E8 lattice codebook + RHT + LDLQ error feedback)
Bits per weight: 4 (uniform)
Calibration: 128 samples from WikiText-2, sequence length 2048
Quantized sublayers: 6,004 (128 MoE experts per layer + attention + MLP)
Average SQNR: 21.64 dB
Model size: 27.7 GB (vs ~60 GB at bf16)
VRAM: 27,677 MB on NVIDIA L40S
Quantization time: 41 minutes on L40S with --streaming

Requirements

glq>=0.2.7
mamba-ssm and causal-conv1d (for Mamba layers)
trust_remote_code=True (custom architecture)
CUDA GPU

License

NVIDIA Open Model License (same as base model). See license.

Downloads last month: 928

Model tree for xv0y5ncu/Nemotron-3-Nano-30B-A3B-GLQ-4bpw

Base model

nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

Finetuned

(34)

this model