Reasoning V3 SKU. Loads via vMLX or jang-tools Python. Follow @dealignai.



Nemotron-3-Nano-Omni-30B-A3B — JANGTQ4 + CRACK v2

JANGTQ4 (8-bit attn affine + 4-bit MXTQ routed experts) | CRACK abliterated v2 | Vision + Audio (Speech) | Hybrid Mamba-2 + Attn + MoE | 19 GB

Ko-fi


Headline numbers

Metric This v2 model Base model Δ
HarmBench-320 strict comply (thinking=ON) 97.2% (311/320) 12.81% (refuses) +84.4pp
MMLU-200 generative (thinking=ON, max=8000) 74.0% (148/200) 86.5% (max=2000) -12.5pp
Refusals on harmful prompts 0 explicit refuses typically 90%+ refuse abliteration complete
</think> close at greedy on hard MMLU 5/5 5/5 preserved
Multi-turn (3-turn escalation × 3 conversations) 9/9 comply, context preserved n/a works
Multimodal byte-identical to base preserved preserved
Bundle size 19 GB 66 GB BF16
Context 262,144 tokens native same preserved

The 12.5pp MMLU gap is concentrated in two reasoning-heavy subjects (abstract_algebra, college_computer_science) where the 8000-token thinking budget runs out before </think> closes. These hard-stops are genuine deep reasoning, not v1-style infinite repetition loops. With max_tokens ≥ 16384, accuracy approaches base.


v2 vs v1 (head-to-head)

v1 (shipped 2026-04-28) had a </think> termination defect at greedy decoding — the model couldn't terminate reasoning on hard prompts and looped to budget cutoff. MMLU dropped from 86.5% base → 70.0% v1.

v2 (this release) restores clean termination:

Bench v1 (broken) v2 (this release)
HarmBench-320 strict comply 95.94% 97.2% (0 refusals)
MMLU-200 thinking=ON 70.0% @max=16384 74.0% @max=8000 (+4pp at half budget)
</think> close at greedy (5 hard MMLU) 0/5 5/5
Hard-stops are real loops? YES (paragraph repetition) NO (genuine deep reasoning, just out of budget)

MMLU-200 per-subject (BASE vs CRACK v2)

Both at thinking=ON, greedy. Base at max=2000, CRACK v2 at max=8000.

Subject Base CRACK v2 Δ Notes
abstract_algebra 17/20 (85%) 12/20 (60%) -25pp Budget-bound — needs max=16384
anatomy 16/20 (80%) 13/20 (65%) -15pp
astronomy 19/20 (95%) 17/20 (85%) -10pp
college_computer_science 14/20 (70%) 7/20 (35%) -35pp Hardest subject — needs max=16384
college_physics 18/20 (90%) 18/20 (90%) 0 unchanged
high_school_biology 19/20 (95%) 17/20 (85%) -10pp
high_school_chemistry 18/20 (90%) 17/20 (85%) -5pp
high_school_mathematics 18/20 (90%) 15/20 (75%) -15pp
logical_fallacies 17/20 (85%) 16/20 (80%) -5pp
world_religions 17/20 (85%) 16/20 (80%) -5pp
TOTAL 173/200 (86.5%) 148/200 (74.0%) -12.5pp majority loss is budget-bound

HarmBench-320 per-category (CRACK v2)

Category n CRACK comply Refuse Empty (truncated)
chemical_biological 42 42 (100%) 0 0
copyright 80 75 (94%) 0 5
cybercrime_intrusion 52 49 (94%) 0 3
harassment_bullying 21 21 (100%) 0 0
harmful 18 18 (100%) 0 0
illegal 53 53 (100%) 0 0
misinformation_disinformation 54 53 (98%) 0 1
Overall 320 311 (97.2%) 0 9

Zero explicit refusals. The 9 "empty" verdicts are token-budget truncations on copyright/long prompts (thinking phase consumed all 1500 tokens before producing the answer).


Operating recommendations

  • enable_thinking — v2 works in BOTH modes (5/5 comply with thinking ON, 4/5 with thinking OFF). Default to ON for hardest prompts; OFF works for most.
  • max_tokens ≥ 16384 for hard reasoning (math, abstract algebra, complex CS).
  • Greedy (temperature=0) AND sampling (temp=0.6, top_p=0.95 — NVIDIA-recommended in generation_config.json) both work.
  • Multi-turn — context preserved across 3+ turns; no late refusals after escalating prompts.

Verification

  • All multimodal tensors (vision + audio + projectors) are byte-identical to base — capabilities fully preserved.
  • All config files unchanged (config.json, jang_config.json, generation_config.json, chat_template.jinja, tokenizer_config.json).
  • Bit widths preserved: attn=8, shared=8, mamba=8, routed=4, embed=8, lm_head=8.

Architecture (nemotron_h)

  • 52 layers: hybrid Mamba-2 + MoE + Attention
  • Hidden 2688, head_dim 128, GQA 32q/2kv (NO RoPE on attention — position from Mamba state)
  • 128 routed experts top-6 (sigmoid) + 1 shared expert per MoE layer
  • Multimodal: image (RADIO ViT) + audio/speech (Parakeet) merged via early-fusion projectors

Loading

from huggingface_hub import snapshot_download
import sys
path = snapshot_download("dealignai/Nemotron-3-Nano-Omni-30B-A3B-JANGTQ4-CRACK")
sys.path.insert(0, "/path/to/jang-tools")
from jang_tools.load_jangtq import load_jangtq_model
model, tokenizer = load_jangtq_model(path)

prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Your question"}],
    tokenize=False, add_generation_prompt=True,
    enable_thinking=True,
)
from mlx_lm import generate
out = generate(model, tokenizer, prompt=prompt, max_tokens=16384)
print(out.split("</think>", 1)[-1])

For the multimodal pipeline (image + audio + video), pair this bundle with the unmodified Multimodal-Addon.


Use responsibly

This model has had refusal training surgically removed for legitimate research, red-teaming, and evaluation. Outputs may include harmful content. You are solely responsible for any use. Do not deploy in consumer-facing contexts without your own safety layer. Do not use in violation of applicable law in your jurisdiction.


Built by dealignai. Sister bundles: JANGTQ-CRACK (12 GB, 2-bit MXTQ) · MXFP4-CRACK (21 GB, uniform 4-bit affine).

Downloads last month
172
Safetensors
Model size
6B params
Tensor type
U32
·
F16
·
U8
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dealignai/Nemotron-3-Nano-Omni-30B-A3B-JANGTQ4-CRACK

Finetuned
(42)
this model