Bonsai

Prism ML Website  |  White Paper  |  Demo & Examples  |  Discord

Ternary-Bonsai-1.7B-mlx-2bit

Ternary (1.58-bit) language model for Apple Silicon

7.2x smaller than FP16 | 3.8x faster on M4 Pro | 103 tok/s on iPhone | runs on Mac, iPhone, iPad

Highlights

  • 0.45 GiB (0.48 GB) packed 2-bit size (down from 3.44 GB FP16) — fits anywhere
  • Ternary weights {-1, 0, +1} across embeddings, attention projections, MLP projections, and LM head
  • 57.5 avg benchmark score across 6 categories
  • 103 tok/s on iPhone 17 Pro Max
  • MLX-native format with group size 128 and FP16 scaling

Pareto Frontier

Resources

  • White Paper
  • Demo repo — examples for serving, benchmarking, and integrating Bonsai
  • Discord — community support and updates
  • Kernels: MLX (Apple Silicon) · mlx-swift (iOS/macOS) — 2-bit format is supported out of the box

Model Overview

Item Specification
Base model Qwen3-1.7B
Parameters 1.72B
Architecture GQA, SwiGLU MLP, RoPE, RMSNorm
Context length 32,768 tokens
Vocab size 151,936
Weight format Ternary g128: {-1, 0, +1} with FP16 group-wise scaling
Packed 2-bit size 0.45 GiB (0.48 GB)
Ternary coverage Embeddings, attention projections, MLP projections, LM head
License Apache 2.0

Quantization Format: Ternary g128

Each weight takes a value from {-1, 0, +1}, with one shared FP16 scale per group of 128 weights:

w_i = scale_g * t_i,    t_i in {-1, 0, +1}

The information-theoretic cost is log2(3) ≈ 1.585 bits per weight, plus FP16 group scales (16 bits per 128 weights), for a theoretical minimum of ~1.71 bits/weight. This release uses the MLX 2-bit format, which stores each ternary value in 2 bits plus group scales, for an effective ~2.125 bits/weight.

Memory

Format Size Reduction Ratio
FP16 3.44 GB -- 1.0x
MLX 2-bit g128 0.45 GiB (0.48 GB) 86.0% 7.2x

Quickstart

MLX (Python)

pip install mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("prism-ml/Ternary-Bonsai-1.7B-mlx-2bit")

response = generate(
    model,
    tokenizer,
    prompt="Explain quantum computing in simple terms.",
    max_tokens=256,
)
print(response)

Throughput (MLX / Apple Silicon)

Platform Backend PP512 (tok/s) TG128 (tok/s) FP16 TG (tok/s) Speedup
M4 Pro 48 GB MLX (Python) 1,764 235 62 3.8x

iPhone 17 Pro Max (MLX Swift)

Platform Backend PP512 (tok/s) TG128 (tok/s) 4-bit TG (tok/s) Speedup
iPhone 17 Pro Max MLX Swift 1,456 103 60 1.7x

Benchmarks

Evaluated with EvalScope v1.4.2 + vLLM 0.15.1 on NVIDIA H100. Full benchmark suite (10 benchmarks):

Model Size Avg MMLU-R MuSR IFEval GSM8K HE+ BFCLv3
Ternary Bonsai 1.7B 0.37 GB 58.47 52.9 50.8 70.1 74.2 51.8 51.0
1-bit Bonsai 1.7B (prior) 0.24 GB 49.60 43.2 45.1 63.0 66.3 45.1 34.9
Qwen3 1.7B 3.44 GB 66.57 66.8 50.1 70.3 83.1 57.3 71.8
Qwen3 0.6B 1.19 GB 48.02 47.5 41.5 62.8 64.1 30.5 41.7
LFM2 1.2B 2.34 GB 46.73 52.9 25.4 77.5 62.2 36.0 26.4
Gemma3 1B 2.00 GB 45.53 43.2 37.0 61.9 64.4 40.2 26.5
Llama 3.2 1B 2.47 GB 39.88 47.2 29.2 47.7 49.0 35.4 30.8

Intelligence Density

density = -ln(1 - score/100) / size_GB
Model Size Intelligence Density (1/GB)
Ternary Bonsai 1.7B 0.37 GB 2.389
1-bit Bonsai 1.7B (prior) 0.24 GB 2.832
Qwen3 0.6B 1.19 GB 0.549
Qwen3 1.7B 3.44 GB 0.318
Gemma3 1B 2.00 GB 0.304
LFM2 1.2B 2.34 GB 0.269
Llama 3.2 1B 2.47 GB 0.206

Citation

@techreport{ternarybonsai,
    title   = {Ternary Bonsai: 1.58-bit Language Models at 8B, 4B, and 1.7B Scale},
    author  = {Prism ML},
    year    = {2026},
    month   = {April},
    url     = {https://prismml.com}
}

Contact

For questions, feedback, or collaboration inquiries: contact@prismml.com

Downloads last month
545
Safetensors
Model size
0.1B params
Tensor type
U32
·
F16
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for prism-ml/Ternary-Bonsai-1.7B-mlx-2bit

Finetuned
(2)
this model

Collection including prism-ml/Ternary-Bonsai-1.7B-mlx-2bit

Evaluation results