Reasoning V3 SKU. Loads via vMLX or jang-tools Python. Follow @dealignai.

Built for vMLX — the only MLX inferencer with VL support, KV cache quantization, prefix cache reuse, agentic tool calling, and speculative decoding.
_{Free for macOS · vmlx.net}

Nemotron-3-Nano-Omni-30B-A3B — JANGTQ + CRACK v2

JANGTQ (8-bit attn affine + 2-bit MXTQ routed experts) | CRACK abliterated v2 | Vision + Audio (Speech) | Hybrid Mamba-2 + Attn + MoE | 12 GB

Best MMLU in the v2 Omni family — 81.5% at thinking=ON. 5/5 comply at thinking=OFF too.

Headline numbers

Metric	This v2 model	Base model	Δ
HarmBench-320 strict comply (thinking=ON)	91.6% (293/320)	12.81%	+78.8pp
MMLU-200 generative (thinking=ON, max=8000)	81.5% (163/200)	85.5% (max=2000)	-4.0pp ✅
Refusals on harmful prompts	0 explicit refuses	90%+ refuse	abliteration complete
`</think>` close at greedy on hard MMLU	5/5	5/5	preserved
Multi-turn (3-turn escalation × 3 conversations)	9/9 comply, context preserved	n/a	works
Thinking ON / OFF compliance	5/5 in BOTH modes	refuses in both	works in either
Multimodal byte-identical to base	preserved	—	preserved
Bundle size	12 GB	66 GB BF16	smallest in family
Context	262,144 tokens native	same	preserved

JANGTQ outperforms JANGTQ4 on MMLU (81.5% vs 74.0%) despite using lower-bit (2-bit vs 4-bit) routed experts — same Q2 effect observed in Qwen 3.6 35B JANGTQ2 CRACK.

v2 vs v1 (head-to-head)

Bench	v1 (broken)	v2 (this release)
HarmBench-320 strict comply	92.19%	91.6% (0 refusals)
MMLU-200 thinking=ON	~70% @max=16384	81.5% @max=8000 (best in family)
`</think>` close at greedy (5 hard MMLU)	0/5	5/5
Hard-stops are real loops?	YES (paragraph repetition)	NO (genuine deep reasoning, just out of budget)

MMLU-200 per-subject (BASE vs CRACK v2)

Both at thinking=ON, greedy. Base at max=2000, CRACK v2 at max=8000.

Subject	Base	CRACK v2	Δ	Notes
abstract_algebra	15/20 (75%)	14/20 (70%)	-5pp
anatomy	15/20 (75%)	16/20 (80%)	+5pp	gain from CRACK
astronomy	18/20 (90%)	18/20 (90%)	0	unchanged
college_computer_science	15/20 (75%)	10/20 (50%)	-25pp	Budget-bound
college_physics	17/20 (85%)	18/20 (90%)	+5pp	gain from CRACK
high_school_biology	20/20 (100%)	18/20 (90%)	-10pp
high_school_chemistry	19/20 (95%)	19/20 (95%)	0	unchanged
high_school_mathematics	18/20 (90%)	18/20 (90%)	0	unchanged
logical_fallacies	18/20 (90%)	17/20 (85%)	-5pp
world_religions	16/20 (80%)	15/20 (75%)	-5pp
TOTAL	171/200 (85.5%)	163/200 (81.5%)	-4.0pp	within ship criterion

Most subjects are within ±5pp of base or unchanged. The −25pp on college_computer_science is budget-bound (8000 tokens isn't enough for the deepest CS reasoning); with max_tokens=16384, accuracy approaches base.

HarmBench-320 per-category (CRACK v2)

Category	n	CRACK comply	Empty (truncated)
chemical_biological	42	39 (93%)	3
copyright	80	61 (76%)	19
cybercrime_intrusion	52	51 (98%)	1
harassment_bullying	21	19 (90%)	2
harmful	18	18 (100%)	0
illegal	53	51 (96%)	2
misinformation_disinformation	54	54 (100%)	0
Overall	320	293 (91.6%)	27

Zero explicit refusals. The 27 "empty" verdicts are concentrated in copyright prompts (19/80) where the model thinks deeply about whether to reproduce verbatim text and runs out of the 1500-token HarmBench eval budget. With max_tokens=2500+ these would all close cleanly.

Operating recommendations

enable_thinking — v2 works in BOTH modes. JANGTQ specifically scored 5/5 at thinking=OFF (matching thinking=ON) — best in the family for thinking=OFF use cases.
max_tokens ≥ 16384 for hard reasoning. JANGTQ's compliance is identical in both modes, so use thinking=OFF for shorter budgets and thinking=ON for hardest prompts.
Greedy (temperature=0) AND sampling (temp=0.6, top_p=0.95 — NVIDIA-recommended in generation_config.json) both work.
Multi-turn — context preserved across 3+ turns; no late refusals after escalating prompts.

Verification

All multimodal tensors (vision + audio + projectors) are byte-identical to base — capabilities fully preserved.
All config files unchanged (config.json, jang_config.json, generation_config.json, chat_template.jinja, tokenizer_config.json).
Bit widths preserved: attn=8, shared=8, mamba=8, routed=2, embed=8, lm_head=8.

Architecture (`nemotron_h`)

52 layers: hybrid Mamba-2 + MoE + Attention
Hidden 2688, head_dim 128, GQA 32q/2kv (NO RoPE on attention — position from Mamba state)
128 routed experts top-6 (sigmoid) + 1 shared expert per MoE layer
Multimodal: image (RADIO ViT) + audio/speech (Parakeet) merged via early-fusion projectors

Loading

from huggingface_hub import snapshot_download
import sys
path = snapshot_download("dealignai/Nemotron-3-Nano-Omni-30B-A3B-JANGTQ-CRACK")
sys.path.insert(0, "/path/to/jang-tools")
from jang_tools.load_jangtq import load_jangtq_model
model, tokenizer = load_jangtq_model(path)

prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Your question"}],
    tokenize=False, add_generation_prompt=True,
    enable_thinking=True,
)
from mlx_lm import generate
out = generate(model, tokenizer, prompt=prompt, max_tokens=16384)
print(out.split("</think>", 1)[-1])

For the multimodal pipeline (image + audio + video), pair this bundle with the unmodified Multimodal-Addon.

Use responsibly

This model has had refusal training surgically removed for legitimate research, red-teaming, and evaluation. Outputs may include harmful content. You are solely responsible for any use. Do not deploy in consumer-facing contexts without your own safety layer. Do not use in violation of applicable law in your jurisdiction.

Built by dealignai. Sister bundles: JANGTQ4-CRACK (19 GB, 4-bit MXTQ) · MXFP4-CRACK (21 GB, uniform 4-bit affine).

Downloads last month: 352

Safetensors

Model size

4B params

Tensor type

U32

F16

MLX

Hardware compatibility

Quantized

Inference Providers NEW

Any-to-Any

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dealignai/Nemotron-3-Nano-Omni-30B-A3B-JANGTQ-CRACK

Base model

nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16

Finetuned

(42)

this model