OsaurusAI/Holo3-35B-A3B-JANGTQ2

Holo3 35B-A3B — JANGTQ2 (MLX)

TurboQuant codebook quantization of H Company's Holo3 GUI-agent VLM — routed experts at 2-bit via Lloyd-Max codebooks + Hadamard rotation, attention / embed / shared-expert / lm_head at 8-bit affine, vision tower preserved.

Model Details

Property	Value
Base model	`Hcompany/Holo3-35B-A3B` (finetune of `Qwen/Qwen3.5-35B-A3B`)
Parameters (source)	35 B total, ~3 B active per token
Architecture	`qwen3_5_moe` — 40 decoder layers: 30 `Gated DeltaNet` (linear attn) + 10 full attention, 256 routed experts + 1 always-on shared expert
Quantization format	`weight_format: mxtq` — routed experts via TurboQuant codebook (2-bit), everything else affine 8-bit or fp16 passthrough
Routed-expert storage	`.tq_packed` (uint32) + `.tq_norms` (fp16) + `.tq_bits` (uint8); codebook + Hadamard signs re-derived deterministically at load
Package size on disk	11.63 GB across 12 shards
Shipped tensors	1,930 total (1,597 language-model + 333 vision tower + 120 routed-expert TQ triples)
Vocab	248,320
Context (position embeddings)	262,144 native
Vision tower	27-layer ViT (hidden 1152, patch 16), preserved in fp16
Chat format	Qwen `im_start`/`im_end` with `<think>` reasoning toggle; Holo3 XML tool-call grammar
Use case	GUI / computer-use agent (desktop, web, mobile) — designed for screenshot → action loops

Quantization details, per tensor category

Category	Bits	Group / codebook	Notes
Routed-expert MLP (`mlp.switch_mlp.gate_proj`, `up_proj`, `down_proj`)	2 (JANGTQ)	2² Lloyd-Max centroids + Hadamard rotation	`.tq_packed` + `.tq_norms` + `.tq_bits` triples
Embedding (`embed_tokens`), `lm_head`	8 (affine)	group 64	MLX-native `QuantizedLinear`
Full-attention projections (`q_proj`, `k_proj`, `v_proj`, `o_proj`)	8 (affine)	group 64	Gate-doubled q_proj for `attn_output_gate`
Linear-attention projections (`in_proj_qkv`, `in_proj_z`, `in_proj_b`, `in_proj_a`, `out_proj`)	8 (affine)	group 64	Gated DeltaNet
Shared-expert MLP (`gate_proj`, `up_proj`, `down_proj`)	8 (affine)	group 64	Always active per token
Router (`mlp.gate`)	fp16 passthrough	—	Precision-critical
Shared-expert gate (`shared_expert_gate`)	fp16 passthrough	—	sigmoid scalar gate
Norms (`_layernorm`, `_norm`), `A_log`, `dt_bias`, `conv1d`	fp16 passthrough	—	Un-quantized
Vision tower (333 tensors)	fp16 passthrough	—	`patch_embed.proj` axes pre-transposed to MLX layout

JANGTQ ("TurboQuant") stores routed-expert weights as indices into a small Lloyd-Max codebook with a per-row norm, after a randomized Hadamard rotation that concentrates the distribution so quantization error is uniform. At inference, the input is rotated once per layer (cheap fused Metal kernel) and dot products happen against the codebook centroids directly, so we never dequantize back to affine. Compared to affine 2-bit at the same bit budget, this gives better quality and faster decode on the routed-expert MLP path.

Usage

JANGTQ requires our custom loader — stock mlx_lm.load() can't parse .tq_packed tensors. You need jang-tools (free, public): https://github.com/jjang-ai/jangq.

pip install mlx mlx-lm mlx-vlm
git clone https://github.com/jjang-ai/jangq && pip install -e ./jangq/jang-tools

Text

from jang_tools.load_jangtq import load_jangtq_model
from mlx_lm import generate

model, tokenizer = load_jangtq_model("OsaurusAI/Holo3-35B-A3B-JANGTQ2")
print(generate(model, tokenizer,
               prompt="The capital of France is",
               max_tokens=64))

Image (VLM) — the intended use

Holo3 is a GUI agent: give it a screenshot and it localizes UI elements and plans actions.

from jang_tools.load_jangtq_vlm import load_jangtq_vlm_model
from mlx_vlm import generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

path = "OsaurusAI/Holo3-35B-A3B-JANGTQ2"
model, processor = load_jangtq_vlm_model(path)
config = load_config(path)

prompt = apply_chat_template(
    processor, config,
    "Look at this desktop screenshot. Where should I click to open settings?",
    num_images=1,
)
print(generate(model, processor, prompt, image="path/to/screenshot.png",
               max_tokens=256))

Reasoning toggle

msgs = [{"role": "user", "content": "What is 17 × 23?"}]
# Reasoning OFF — pre-closed <think></think> block
prompt = tokenizer.apply_chat_template(msgs, add_generation_prompt=True,
                                       enable_thinking=False)
# Reasoning ON — model fills the <think> block
prompt = tokenizer.apply_chat_template(msgs, add_generation_prompt=True,
                                       enable_thinking=True)

Pass enable_thinking as a direct kwarg (the chat_template_kwargs={...} form only propagates on some tokenizer versions).

Tool calls — Holo3 XML format

Holo3 emits tool calls in a custom XML grammar (not JSON). Pass tools=[...] to the tokenizer's chat template; the model responds in this shape:

<tool_call>
<function=click>
<parameter=x>
512
</parameter>
<parameter=y>
384
</parameter>
</function>
</tool_call>

Parse with a simple XML splitter on <tool_call>. See H Company's quickstart for a full agent harness example.

Video

The base model supports video via transformers and the bundle preserves video_preprocessor_config.json. mlx-vlm 0.4.4's prepare_inputs has no video path yet for qwen3_5_moe — for video, use upstream transformers.

Hardware notes

11.63 GB on disk; expect ~12–14 GB resident after load, plus KV cache.

Mac unified RAM	Works?	Notes
16 GB	✅ text-only	Image inference will be tight at long context
24 GB	✅ comfortable	32 k+ context, image inference OK
32 GB	✅	100 k context viable, comfortable VL
64 GB+	✅ headroom	262 k native context

Upstream benchmarks

These are the base-model numbers for Hcompany/Holo3-35B-A3B, not evaluations of this JANGTQ2 quant:

Benchmark	Score
OSWorld-Verified (computer use)	77.8 % — SOTA at 3 B active
WebArena (web navigation)	State-of-the-art (see upstream card)
ScreenSpot-Pro (UI localization)	Top-tier (see upstream card)
OSWorld-G (visual grounding)	Top-tier (see upstream card)
H Corporate Benchmark (486 enterprise tasks)	Outperforms larger competitors

Independent JANGTQ-quant evaluation is tracked in the jang-tools repo and will land in future README revisions.

Citation

@misc{hai2025holo3modelfamily,
      title  = {Holo3 - Open Foundation Models for Navigation and Computer Use Agents},
      author = {H Company},
      year   = {2026},
      url    = {https://huggingface.co/Hcompany/Holo3-35B-A3B}
}

License

Apache 2.0 — inherits from the base model.

Downloads last month: 37

Safetensors

Model size

3B params

Tensor type

U32

F16

MLX

Hardware compatibility

Quantized

Model tree for OsaurusAI/Holo3-35B-A3B-JANGTQ2

Base model

Qwen/Qwen3.5-35B-A3B-Base

Finetuned

Qwen/Qwen3.5-35B-A3B

Finetuned

Hcompany/Holo3-35B-A3B

Finetuned

(5)

this model