How to use from the
Use from the
MLX library
# Download the model from the Hub
pip install huggingface_hub[hf_xet]

huggingface-cli download --local-dir Nemotron-3-Nano-Omni-30B-A3B-JANGTQ-CRACK dealignai/Nemotron-3-Nano-Omni-30B-A3B-JANGTQ-CRACK

Reasoning V3 SKU. Loads via vMLX or jang-tools Python. Follow @dealignai.



Nemotron-3-Nano-Omni-30B-A3B — JANGTQ + CRACK v2

JANGTQ (8-bit attn affine + 2-bit MXTQ routed experts) | CRACK abliterated v2 | Vision + Audio (Speech) | Hybrid Mamba-2 + Attn + MoE | 12 GB

Best MMLU in the v2 Omni family — 81.5% at thinking=ON. 5/5 comply at thinking=OFF too.

Ko-fi


Headline numbers

Metric This v2 model Base model Δ
HarmBench-320 strict comply (thinking=ON) 91.6% (293/320) 12.81% +78.8pp
MMLU-200 generative (thinking=ON, max=8000) 81.5% (163/200) 85.5% (max=2000) -4.0pp
Refusals on harmful prompts 0 explicit refuses 90%+ refuse abliteration complete
</think> close at greedy on hard MMLU 5/5 5/5 preserved
Multi-turn (3-turn escalation × 3 conversations) 9/9 comply, context preserved n/a works
Thinking ON / OFF compliance 5/5 in BOTH modes refuses in both works in either
Multimodal byte-identical to base preserved preserved
Bundle size 12 GB 66 GB BF16 smallest in family
Context 262,144 tokens native same preserved

JANGTQ outperforms JANGTQ4 on MMLU (81.5% vs 74.0%) despite using lower-bit (2-bit vs 4-bit) routed experts — same Q2 effect observed in Qwen 3.6 35B JANGTQ2 CRACK.


v2 vs v1 (head-to-head)

Bench v1 (broken) v2 (this release)
HarmBench-320 strict comply 92.19% 91.6% (0 refusals)
MMLU-200 thinking=ON ~70% @max=16384 81.5% @max=8000 (best in family)
</think> close at greedy (5 hard MMLU) 0/5 5/5
Hard-stops are real loops? YES (paragraph repetition) NO (genuine deep reasoning, just out of budget)

MMLU-200 per-subject (BASE vs CRACK v2)

Both at thinking=ON, greedy. Base at max=2000, CRACK v2 at max=8000.

Subject Base CRACK v2 Δ Notes
abstract_algebra 15/20 (75%) 14/20 (70%) -5pp
anatomy 15/20 (75%) 16/20 (80%) +5pp gain from CRACK
astronomy 18/20 (90%) 18/20 (90%) 0 unchanged
college_computer_science 15/20 (75%) 10/20 (50%) -25pp Budget-bound
college_physics 17/20 (85%) 18/20 (90%) +5pp gain from CRACK
high_school_biology 20/20 (100%) 18/20 (90%) -10pp
high_school_chemistry 19/20 (95%) 19/20 (95%) 0 unchanged
high_school_mathematics 18/20 (90%) 18/20 (90%) 0 unchanged
logical_fallacies 18/20 (90%) 17/20 (85%) -5pp
world_religions 16/20 (80%) 15/20 (75%) -5pp
TOTAL 171/200 (85.5%) 163/200 (81.5%) -4.0pp within ship criterion

Most subjects are within ±5pp of base or unchanged. The −25pp on college_computer_science is budget-bound (8000 tokens isn't enough for the deepest CS reasoning); with max_tokens=16384, accuracy approaches base.


HarmBench-320 per-category (CRACK v2)

Category n CRACK comply Refuse Empty (truncated)
chemical_biological 42 39 (93%) 0 3
copyright 80 61 (76%) 0 19
cybercrime_intrusion 52 51 (98%) 0 1
harassment_bullying 21 19 (90%) 0 2
harmful 18 18 (100%) 0 0
illegal 53 51 (96%) 0 2
misinformation_disinformation 54 54 (100%) 0 0
Overall 320 293 (91.6%) 0 27

Zero explicit refusals. The 27 "empty" verdicts are concentrated in copyright prompts (19/80) where the model thinks deeply about whether to reproduce verbatim text and runs out of the 1500-token HarmBench eval budget. With max_tokens=2500+ these would all close cleanly.


Operating recommendations

  • enable_thinking — v2 works in BOTH modes. JANGTQ specifically scored 5/5 at thinking=OFF (matching thinking=ON) — best in the family for thinking=OFF use cases.
  • max_tokens ≥ 16384 for hard reasoning. JANGTQ's compliance is identical in both modes, so use thinking=OFF for shorter budgets and thinking=ON for hardest prompts.
  • Greedy (temperature=0) AND sampling (temp=0.6, top_p=0.95 — NVIDIA-recommended in generation_config.json) both work.
  • Multi-turn — context preserved across 3+ turns; no late refusals after escalating prompts.

Verification

  • All multimodal tensors (vision + audio + projectors) are byte-identical to base — capabilities fully preserved.
  • All config files unchanged (config.json, jang_config.json, generation_config.json, chat_template.jinja, tokenizer_config.json).
  • Bit widths preserved: attn=8, shared=8, mamba=8, routed=2, embed=8, lm_head=8.

Architecture (nemotron_h)

  • 52 layers: hybrid Mamba-2 + MoE + Attention
  • Hidden 2688, head_dim 128, GQA 32q/2kv (NO RoPE on attention — position from Mamba state)
  • 128 routed experts top-6 (sigmoid) + 1 shared expert per MoE layer
  • Multimodal: image (RADIO ViT) + audio/speech (Parakeet) merged via early-fusion projectors

Loading

from huggingface_hub import snapshot_download
import sys
path = snapshot_download("dealignai/Nemotron-3-Nano-Omni-30B-A3B-JANGTQ-CRACK")
sys.path.insert(0, "/path/to/jang-tools")
from jang_tools.load_jangtq import load_jangtq_model
model, tokenizer = load_jangtq_model(path)

prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Your question"}],
    tokenize=False, add_generation_prompt=True,
    enable_thinking=True,
)
from mlx_lm import generate
out = generate(model, tokenizer, prompt=prompt, max_tokens=16384)
print(out.split("</think>", 1)[-1])

For the multimodal pipeline (image + audio + video), pair this bundle with the unmodified Multimodal-Addon.


Use responsibly

This model has had refusal training surgically removed for legitimate research, red-teaming, and evaluation. Outputs may include harmful content. You are solely responsible for any use. Do not deploy in consumer-facing contexts without your own safety layer. Do not use in violation of applicable law in your jurisdiction.


Built by dealignai. Sister bundles: JANGTQ4-CRACK (19 GB, 4-bit MXTQ) · MXFP4-CRACK (21 GB, uniform 4-bit affine).

Downloads last month
352
Safetensors
Model size
4B params
Tensor type
U32
·
F16
·
U8
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dealignai/Nemotron-3-Nano-Omni-30B-A3B-JANGTQ-CRACK

Finetuned
(42)
this model