OsaurusAI/Holo3-35B-A3B-JANGTQ4

Holo3 35B-A3B — JANGTQ4 (MLX)

TurboQuant codebook quantization of H Company's Holo3 GUI-agent VLM — routed experts at 4-bit via Lloyd-Max codebooks + Hadamard rotation, attention / embed / shared-expert / lm_head at 8-bit affine, vision tower preserved. The highest-quality JANGTQ profile for this family.

Model Details

Property	Value
Base model	`Hcompany/Holo3-35B-A3B` (finetune of `Qwen/Qwen3.5-35B-A3B`)
Parameters (source)	35 B total, ~3 B active per token
Architecture	`qwen3_5_moe` — 40 decoder layers: 30 `Gated DeltaNet` (linear attn) + 10 full attention, 256 routed experts + 1 always-on shared expert
Quantization format	`weight_format: mxtq` — routed experts via TurboQuant codebook (4-bit), everything else affine 8-bit or fp16 passthrough
Routed-expert storage	`.tq_packed` (uint32) + `.tq_norms` (fp16) + `.tq_bits` (uint8); codebook + Hadamard signs re-derived deterministically at load
Package size on disk	19.68 GB across 19 shards
Shipped tensors	1,930 total (1,597 language-model + 333 vision tower + 120 routed-expert TQ triples)
Vocab	248,320
Context (position embeddings)	262,144 native
Vision tower	27-layer ViT (hidden 1152, patch 16), preserved in fp16
Chat format	Qwen `im_start`/`im_end` with `<think>` reasoning toggle; Holo3 XML tool-call grammar
Use case	GUI / computer-use agent (desktop, web, mobile) — designed for screenshot → action loops

Quantization details, per tensor category

Category	Bits	Group / codebook	Notes
Routed-expert MLP (`mlp.switch_mlp.gate_proj`, `up_proj`, `down_proj`)	4 (JANGTQ)	2⁴ Lloyd-Max centroids + Hadamard rotation	`.tq_packed` + `.tq_norms` + `.tq_bits` triples
Embedding (`embed_tokens`), `lm_head`	8 (affine)	group 64	MLX-native `QuantizedLinear`
Full-attention projections (`q_proj`, `k_proj`, `v_proj`, `o_proj`)	8 (affine)	group 64	Gate-doubled q_proj for `attn_output_gate`
Linear-attention projections (`in_proj_qkv`, `in_proj_z`, `in_proj_b`, `in_proj_a`, `out_proj`)	8 (affine)	group 64	Gated DeltaNet
Shared-expert MLP (`gate_proj`, `up_proj`, `down_proj`)	8 (affine)	group 64	Always active per token
Router (`mlp.gate`)	fp16 passthrough	—	Precision-critical
Shared-expert gate (`shared_expert_gate`)	fp16 passthrough	—	sigmoid scalar gate
Norms (`_layernorm`, `_norm`), `A_log`, `dt_bias`, `conv1d`	fp16 passthrough	—	Un-quantized
Vision tower (333 tensors)	fp16 passthrough	—	`patch_embed.proj` axes pre-transposed to MLX layout

JANGTQ ("TurboQuant") stores routed-expert weights as indices into a small Lloyd-Max codebook with a per-row norm, after a randomized Hadamard rotation that concentrates the distribution so quantization error is uniform. At 4-bit, the 16-centroid codebook captures the routed-expert weight distribution tightly enough that this profile is the highest-quality JANGTQ option for qwen3_5_moe — preferred over JANGTQ2 when RAM isn't the bottleneck.

Usage

JANGTQ requires our custom loader — stock mlx_lm.load() can't parse .tq_packed tensors. You need jang-tools (free, public): https://github.com/jjang-ai/jangq.

pip install mlx mlx-lm mlx-vlm
git clone https://github.com/jjang-ai/jangq && pip install -e ./jangq/jang-tools

Text

from jang_tools.load_jangtq import load_jangtq_model
from mlx_lm import generate

model, tokenizer = load_jangtq_model("OsaurusAI/Holo3-35B-A3B-JANGTQ4")
print(generate(model, tokenizer,
               prompt="The capital of France is",
               max_tokens=64))

Image (VLM) — the intended use

Holo3 is a GUI agent: give it a screenshot and it localizes UI elements and plans actions.

from jang_tools.load_jangtq_vlm import load_jangtq_vlm_model
from mlx_vlm import generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

path = "OsaurusAI/Holo3-35B-A3B-JANGTQ4"
model, processor = load_jangtq_vlm_model(path)
config = load_config(path)

prompt = apply_chat_template(
    processor, config,
    "Look at this desktop screenshot. Where should I click to open settings?",
    num_images=1,
)
print(generate(model, processor, prompt, image="path/to/screenshot.png",
               max_tokens=256))

Reasoning toggle

msgs = [{"role": "user", "content": "What is 17 × 23?"}]
# Reasoning OFF — pre-closed <think></think> block
prompt = tokenizer.apply_chat_template(msgs, add_generation_prompt=True,
                                       enable_thinking=False)
# Reasoning ON — model fills the <think> block
prompt = tokenizer.apply_chat_template(msgs, add_generation_prompt=True,
                                       enable_thinking=True)

Pass enable_thinking as a direct kwarg (the chat_template_kwargs={...} form only propagates on some tokenizer versions).

Tool calls — Holo3 XML format

Holo3 emits tool calls in a custom XML grammar (not JSON). Pass tools=[...] to the tokenizer's chat template; the model responds in this shape:

<tool_call>
<function=click>
<parameter=x>
512
</parameter>
<parameter=y>
384
</parameter>
</function>
</tool_call>

Parse with a simple XML splitter on <tool_call>. See H Company's quickstart for a full agent harness example.

Video

The base model supports video via transformers and the bundle preserves video_preprocessor_config.json. mlx-vlm 0.4.4's prepare_inputs has no video path yet for qwen3_5_moe — for video, use upstream transformers.

Hardware notes

19.68 GB on disk; expect ~20–22 GB resident after load, plus KV cache.

Mac unified RAM	Works?	Notes
24 GB	✅ text-only	Image inference will be tight at long context
32 GB	✅ comfortable	32 k+ context, comfortable VL
64 GB+	✅ headroom	262 k native context

Upstream benchmarks

These are the base-model numbers for Hcompany/Holo3-35B-A3B, not evaluations of this JANGTQ4 quant:

Benchmark	Score
OSWorld-Verified (computer use)	77.8 % — SOTA at 3 B active
WebArena (web navigation)	State-of-the-art (see upstream card)
ScreenSpot-Pro (UI localization)	Top-tier (see upstream card)
OSWorld-G (visual grounding)	Top-tier (see upstream card)
H Corporate Benchmark (486 enterprise tasks)	Outperforms larger competitors

Independent JANGTQ-quant evaluation is tracked in the jang-tools repo and will land in future README revisions.

Citation

@misc{hai2025holo3modelfamily,
      title  = {Holo3 - Open Foundation Models for Navigation and Computer Use Agents},
      author = {H Company},
      year   = {2026},
      url    = {https://huggingface.co/Hcompany/Holo3-35B-A3B}
}

License

Apache 2.0 — inherits from the base model.

Downloads last month: 41

Safetensors

Model size

5B params

Tensor type

U32

F16

MLX

Hardware compatibility

Quantized

Model tree for OsaurusAI/Holo3-35B-A3B-JANGTQ4

Base model

Qwen/Qwen3.5-35B-A3B-Base

Finetuned

Qwen/Qwen3.5-35B-A3B

Finetuned

Hcompany/Holo3-35B-A3B

Finetuned

(5)

this model