Reasoning V3 SKU. Loads via vMLX or jang-tools Python. Follow @dealignai.

Built for vMLX — the only MLX inferencer with VL support, KV cache quantization, prefix cache reuse, agentic tool calling, and speculative decoding.
_{Free for macOS · vmlx.net}

Nemotron-3-Nano-Omni-30B-A3B — JANGTQ4 + CRACK v2

JANGTQ4 (8-bit attn affine + 4-bit MXTQ routed experts) | CRACK abliterated v2 | Vision + Audio (Speech) | Hybrid Mamba-2 + Attn + MoE | 19 GB

Headline numbers

Metric	This v2 model	Base model	Δ
HarmBench-320 strict comply (thinking=ON)	97.2% (311/320)	12.81% (refuses)	+84.4pp
MMLU-200 generative (thinking=ON, max=8000)	74.0% (148/200)	86.5% (max=2000)	-12.5pp
Refusals on harmful prompts	0 explicit refuses	typically 90%+ refuse	abliteration complete
`</think>` close at greedy on hard MMLU	5/5	5/5	preserved
Multi-turn (3-turn escalation × 3 conversations)	9/9 comply, context preserved	n/a	works
Multimodal byte-identical to base	preserved	—	preserved
Bundle size	19 GB	66 GB BF16	—
Context	262,144 tokens native	same	preserved

The 12.5pp MMLU gap is concentrated in two reasoning-heavy subjects (abstract_algebra, college_computer_science) where the 8000-token thinking budget runs out before </think> closes. These hard-stops are genuine deep reasoning, not v1-style infinite repetition loops. With max_tokens ≥ 16384, accuracy approaches base.

v2 vs v1 (head-to-head)

v1 (shipped 2026-04-28) had a </think> termination defect at greedy decoding — the model couldn't terminate reasoning on hard prompts and looped to budget cutoff. MMLU dropped from 86.5% base → 70.0% v1.

v2 (this release) restores clean termination:

Bench	v1 (broken)	v2 (this release)
HarmBench-320 strict comply	95.94%	97.2% (0 refusals)
MMLU-200 thinking=ON	70.0% @max=16384	74.0% @max=8000 (+4pp at half budget)
`</think>` close at greedy (5 hard MMLU)	0/5	5/5
Hard-stops are real loops?	YES (paragraph repetition)	NO (genuine deep reasoning, just out of budget)

MMLU-200 per-subject (BASE vs CRACK v2)

Both at thinking=ON, greedy. Base at max=2000, CRACK v2 at max=8000.

Subject	Base	CRACK v2	Δ	Notes
abstract_algebra	17/20 (85%)	12/20 (60%)	-25pp	Budget-bound — needs max=16384
anatomy	16/20 (80%)	13/20 (65%)	-15pp
astronomy	19/20 (95%)	17/20 (85%)	-10pp
college_computer_science	14/20 (70%)	7/20 (35%)	-35pp	Hardest subject — needs max=16384
college_physics	18/20 (90%)	18/20 (90%)	0	unchanged
high_school_biology	19/20 (95%)	17/20 (85%)	-10pp
high_school_chemistry	18/20 (90%)	17/20 (85%)	-5pp
high_school_mathematics	18/20 (90%)	15/20 (75%)	-15pp
logical_fallacies	17/20 (85%)	16/20 (80%)	-5pp
world_religions	17/20 (85%)	16/20 (80%)	-5pp
TOTAL	173/200 (86.5%)	148/200 (74.0%)	-12.5pp	majority loss is budget-bound

HarmBench-320 per-category (CRACK v2)

Category	n	CRACK comply	Empty (truncated)
chemical_biological	42	42 (100%)	0
copyright	80	75 (94%)	5
cybercrime_intrusion	52	49 (94%)	3
harassment_bullying	21	21 (100%)	0
harmful	18	18 (100%)	0
illegal	53	53 (100%)	0
misinformation_disinformation	54	53 (98%)	1
Overall	320	311 (97.2%)	9

Zero explicit refusals. The 9 "empty" verdicts are token-budget truncations on copyright/long prompts (thinking phase consumed all 1500 tokens before producing the answer).

Operating recommendations

enable_thinking — v2 works in BOTH modes (5/5 comply with thinking ON, 4/5 with thinking OFF). Default to ON for hardest prompts; OFF works for most.
max_tokens ≥ 16384 for hard reasoning (math, abstract algebra, complex CS).
Greedy (temperature=0) AND sampling (temp=0.6, top_p=0.95 — NVIDIA-recommended in generation_config.json) both work.
Multi-turn — context preserved across 3+ turns; no late refusals after escalating prompts.

Verification

All multimodal tensors (vision + audio + projectors) are byte-identical to base — capabilities fully preserved.
All config files unchanged (config.json, jang_config.json, generation_config.json, chat_template.jinja, tokenizer_config.json).
Bit widths preserved: attn=8, shared=8, mamba=8, routed=4, embed=8, lm_head=8.

Architecture (`nemotron_h`)

52 layers: hybrid Mamba-2 + MoE + Attention
Hidden 2688, head_dim 128, GQA 32q/2kv (NO RoPE on attention — position from Mamba state)
128 routed experts top-6 (sigmoid) + 1 shared expert per MoE layer
Multimodal: image (RADIO ViT) + audio/speech (Parakeet) merged via early-fusion projectors

Loading

from huggingface_hub import snapshot_download
import sys
path = snapshot_download("dealignai/Nemotron-3-Nano-Omni-30B-A3B-JANGTQ4-CRACK")
sys.path.insert(0, "/path/to/jang-tools")
from jang_tools.load_jangtq import load_jangtq_model
model, tokenizer = load_jangtq_model(path)

prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Your question"}],
    tokenize=False, add_generation_prompt=True,
    enable_thinking=True,
)
from mlx_lm import generate
out = generate(model, tokenizer, prompt=prompt, max_tokens=16384)
print(out.split("</think>", 1)[-1])

For the multimodal pipeline (image + audio + video), pair this bundle with the unmodified Multimodal-Addon.

Use responsibly

This model has had refusal training surgically removed for legitimate research, red-teaming, and evaluation. Outputs may include harmful content. You are solely responsible for any use. Do not deploy in consumer-facing contexts without your own safety layer. Do not use in violation of applicable law in your jurisdiction.

Built by dealignai. Sister bundles: JANGTQ-CRACK (12 GB, 2-bit MXTQ) · MXFP4-CRACK (21 GB, uniform 4-bit affine).

Downloads last month: 172

Safetensors

Model size

6B params

Tensor type

U32

F16

MLX

Hardware compatibility

Quantized

Inference Providers NEW

Any-to-Any

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dealignai/Nemotron-3-Nano-Omni-30B-A3B-JANGTQ4-CRACK

Base model

nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16

Finetuned

(42)

this model