Holo3 35B-A3B — JANGTQ4 (MLX)
TurboQuant codebook quantization of H Company's Holo3 GUI-agent VLM — routed experts at 4-bit via Lloyd-Max codebooks + Hadamard rotation, attention / embed / shared-expert / lm_head at 8-bit affine, vision tower preserved. The highest-quality JANGTQ profile for this family.
Model Details
| Property | Value |
|---|---|
| Base model | Hcompany/Holo3-35B-A3B (finetune of Qwen/Qwen3.5-35B-A3B) |
| Parameters (source) | 35 B total, ~3 B active per token |
| Architecture | qwen3_5_moe — 40 decoder layers: 30 Gated DeltaNet (linear attn) + 10 full attention, 256 routed experts + 1 always-on shared expert |
| Quantization format | weight_format: mxtq — routed experts via TurboQuant codebook (4-bit), everything else affine 8-bit or fp16 passthrough |
| Routed-expert storage | .tq_packed (uint32) + .tq_norms (fp16) + .tq_bits (uint8); codebook + Hadamard signs re-derived deterministically at load |
| Package size on disk | 19.68 GB across 19 shards |
| Shipped tensors | 1,930 total (1,597 language-model + 333 vision tower + 120 routed-expert TQ triples) |
| Vocab | 248,320 |
| Context (position embeddings) | 262,144 native |
| Vision tower | 27-layer ViT (hidden 1152, patch 16), preserved in fp16 |
| Chat format | Qwen im_start/im_end with <think> reasoning toggle; Holo3 XML tool-call grammar |
| Use case | GUI / computer-use agent (desktop, web, mobile) — designed for screenshot → action loops |
Quantization details, per tensor category
| Category | Bits | Group / codebook | Notes |
|---|---|---|---|
Routed-expert MLP (mlp.switch_mlp.gate_proj, up_proj, down_proj) |
4 (JANGTQ) | 2⁴ Lloyd-Max centroids + Hadamard rotation | .tq_packed + .tq_norms + .tq_bits triples |
Embedding (embed_tokens), lm_head |
8 (affine) | group 64 | MLX-native QuantizedLinear |
Full-attention projections (q_proj, k_proj, v_proj, o_proj) |
8 (affine) | group 64 | Gate-doubled q_proj for attn_output_gate |
Linear-attention projections (in_proj_qkv, in_proj_z, in_proj_b, in_proj_a, out_proj) |
8 (affine) | group 64 | Gated DeltaNet |
Shared-expert MLP (gate_proj, up_proj, down_proj) |
8 (affine) | group 64 | Always active per token |
Router (mlp.gate) |
fp16 passthrough | — | Precision-critical |
Shared-expert gate (shared_expert_gate) |
fp16 passthrough | — | sigmoid scalar gate |
Norms (*_layernorm, *_norm), A_log, dt_bias, conv1d |
fp16 passthrough | — | Un-quantized |
| Vision tower (333 tensors) | fp16 passthrough | — | patch_embed.proj axes pre-transposed to MLX layout |
JANGTQ ("TurboQuant") stores routed-expert weights as indices into a small Lloyd-Max codebook with a per-row norm, after a randomized Hadamard rotation that concentrates the distribution so quantization error is uniform. At 4-bit, the 16-centroid codebook captures the routed-expert weight distribution tightly enough that this profile is the highest-quality JANGTQ option for qwen3_5_moe — preferred over JANGTQ2 when RAM isn't the bottleneck.
Usage
JANGTQ requires our custom loader — stock mlx_lm.load() can't parse .tq_packed tensors. You need jang-tools (free, public): https://github.com/jjang-ai/jangq.
pip install mlx mlx-lm mlx-vlm
git clone https://github.com/jjang-ai/jangq && pip install -e ./jangq/jang-tools
Text
from jang_tools.load_jangtq import load_jangtq_model
from mlx_lm import generate
model, tokenizer = load_jangtq_model("OsaurusAI/Holo3-35B-A3B-JANGTQ4")
print(generate(model, tokenizer,
prompt="The capital of France is",
max_tokens=64))
Image (VLM) — the intended use
Holo3 is a GUI agent: give it a screenshot and it localizes UI elements and plans actions.
from jang_tools.load_jangtq_vlm import load_jangtq_vlm_model
from mlx_vlm import generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
path = "OsaurusAI/Holo3-35B-A3B-JANGTQ4"
model, processor = load_jangtq_vlm_model(path)
config = load_config(path)
prompt = apply_chat_template(
processor, config,
"Look at this desktop screenshot. Where should I click to open settings?",
num_images=1,
)
print(generate(model, processor, prompt, image="path/to/screenshot.png",
max_tokens=256))
Reasoning toggle
msgs = [{"role": "user", "content": "What is 17 × 23?"}]
# Reasoning OFF — pre-closed <think></think> block
prompt = tokenizer.apply_chat_template(msgs, add_generation_prompt=True,
enable_thinking=False)
# Reasoning ON — model fills the <think> block
prompt = tokenizer.apply_chat_template(msgs, add_generation_prompt=True,
enable_thinking=True)
Pass enable_thinking as a direct kwarg (the chat_template_kwargs={...} form only propagates on some tokenizer versions).
Tool calls — Holo3 XML format
Holo3 emits tool calls in a custom XML grammar (not JSON). Pass tools=[...] to the tokenizer's chat template; the model responds in this shape:
<tool_call>
<function=click>
<parameter=x>
512
</parameter>
<parameter=y>
384
</parameter>
</function>
</tool_call>
Parse with a simple XML splitter on <tool_call>. See H Company's quickstart for a full agent harness example.
Video
The base model supports video via transformers and the bundle preserves video_preprocessor_config.json. mlx-vlm 0.4.4's prepare_inputs has no video path yet for qwen3_5_moe — for video, use upstream transformers.
Hardware notes
19.68 GB on disk; expect ~20–22 GB resident after load, plus KV cache.
| Mac unified RAM | Works? | Notes |
|---|---|---|
| 24 GB | ✅ text-only | Image inference will be tight at long context |
| 32 GB | ✅ comfortable | 32 k+ context, comfortable VL |
| 64 GB+ | ✅ headroom | 262 k native context |
Upstream benchmarks
These are the base-model numbers for Hcompany/Holo3-35B-A3B, not evaluations of this JANGTQ4 quant:
| Benchmark | Score |
|---|---|
| OSWorld-Verified (computer use) | 77.8 % — SOTA at 3 B active |
| WebArena (web navigation) | State-of-the-art (see upstream card) |
| ScreenSpot-Pro (UI localization) | Top-tier (see upstream card) |
| OSWorld-G (visual grounding) | Top-tier (see upstream card) |
| H Corporate Benchmark (486 enterprise tasks) | Outperforms larger competitors |
Independent JANGTQ-quant evaluation is tracked in the jang-tools repo and will land in future README revisions.
Citation
@misc{hai2025holo3modelfamily,
title = {Holo3 - Open Foundation Models for Navigation and Computer Use Agents},
author = {H Company},
year = {2026},
url = {https://huggingface.co/Hcompany/Holo3-35B-A3B}
}
License
Apache 2.0 — inherits from the base model.
Packaged on Apple Silicon with jang-tools (mlx-lm 0.31.2) by Jinho Jang (eric@jangq.ai).
© 2026 Osaurus AI — osaurus.ai
- Downloads last month
- 41
Quantized
Model tree for OsaurusAI/Holo3-35B-A3B-JANGTQ4
Base model
Qwen/Qwen3.5-35B-A3B-Base