Ternary-Bonsai-1.7B-mlx-2bit

Ternary (1.58-bit) language model for Apple Silicon

7.2x smaller than FP16 | 3.8x faster on M4 Pro | 103 tok/s on iPhone | runs on Mac, iPhone, iPad

Highlights

0.45 GiB (0.48 GB) packed 2-bit size (down from 3.44 GB FP16) — fits anywhere
Ternary weights {-1, 0, +1} across embeddings, attention projections, MLP projections, and LM head
57.5 avg benchmark score across 6 categories
103 tok/s on iPhone 17 Pro Max
MLX-native format with group size 128 and FP16 scaling

Pareto Frontier

Resources

White Paper
Demo repo — examples for serving, benchmarking, and integrating Bonsai
Discord — community support and updates
Kernels: MLX (Apple Silicon) · mlx-swift (iOS/macOS) — 2-bit format is supported out of the box

Model Overview

Item	Specification
Base model	Qwen3-1.7B
Parameters	1.72B
Architecture	GQA, SwiGLU MLP, RoPE, RMSNorm
Context length	32,768 tokens
Vocab size	151,936
Weight format	Ternary g128: {-1, 0, +1} with FP16 group-wise scaling
Packed 2-bit size	0.45 GiB (0.48 GB)
Ternary coverage	Embeddings, attention projections, MLP projections, LM head
License	Apache 2.0

Quantization Format: Ternary g128

Each weight takes a value from {-1, 0, +1}, with one shared FP16 scale per group of 128 weights:

w_i = scale_g * t_i,    t_i in {-1, 0, +1}

The information-theoretic cost is log2(3) ≈ 1.585 bits per weight, plus FP16 group scales (16 bits per 128 weights), for a theoretical minimum of ~1.71 bits/weight. This release uses the MLX 2-bit format, which stores each ternary value in 2 bits plus group scales, for an effective ~2.125 bits/weight.

Memory

Format	Size	Reduction	Ratio
FP16	3.44 GB	--	1.0x
MLX 2-bit g128	0.45 GiB (0.48 GB)	86.0%	7.2x

Quickstart

MLX (Python)

pip install mlx-lm

from mlx_lm import load, generate

model, tokenizer = load("prism-ml/Ternary-Bonsai-1.7B-mlx-2bit")

response = generate(
    model,
    tokenizer,
    prompt="Explain quantum computing in simple terms.",
    max_tokens=256,
)
print(response)

Throughput (MLX / Apple Silicon)

Platform	Backend	PP512 (tok/s)	TG128 (tok/s)	FP16 TG (tok/s)	Speedup
M4 Pro 48 GB	MLX (Python)	1,764	235	62	3.8x

iPhone 17 Pro Max (MLX Swift)

Platform	Backend	PP512 (tok/s)	TG128 (tok/s)	4-bit TG (tok/s)	Speedup
iPhone 17 Pro Max	MLX Swift	1,456	103	60	1.7x

Benchmarks

Evaluated with EvalScope v1.4.2 + vLLM 0.15.1 on NVIDIA H100. Full benchmark suite (10 benchmarks):

Model	Size	Avg	MMLU-R	MuSR	IFEval	GSM8K	HE+	BFCLv3
Ternary Bonsai 1.7B	0.37 GB	58.47	52.9	50.8	70.1	74.2	51.8	51.0
1-bit Bonsai 1.7B (prior)	0.24 GB	49.60	43.2	45.1	63.0	66.3	45.1	34.9
Qwen3 1.7B	3.44 GB	66.57	66.8	50.1	70.3	83.1	57.3	71.8
Qwen3 0.6B	1.19 GB	48.02	47.5	41.5	62.8	64.1	30.5	41.7
LFM2 1.2B	2.34 GB	46.73	52.9	25.4	77.5	62.2	36.0	26.4
Gemma3 1B	2.00 GB	45.53	43.2	37.0	61.9	64.4	40.2	26.5
Llama 3.2 1B	2.47 GB	39.88	47.2	29.2	47.7	49.0	35.4	30.8

Intelligence Density

density = -ln(1 - score/100) / size_GB

Model	Size	Intelligence Density (1/GB)
Ternary Bonsai 1.7B	0.37 GB	2.389
1-bit Bonsai 1.7B (prior)	0.24 GB	2.832
Qwen3 0.6B	1.19 GB	0.549
Qwen3 1.7B	3.44 GB	0.318
Gemma3 1B	2.00 GB	0.304
LFM2 1.2B	2.34 GB	0.269
Llama 3.2 1B	2.47 GB	0.206

Citation

@techreport{ternarybonsai,
    title   = {Ternary Bonsai: 1.58-bit Language Models at 8B, 4B, and 1.7B Scale},
    author  = {Prism ML},
    year    = {2026},
    month   = {April},
    url     = {https://prismml.com}
}

Contact

For questions, feedback, or collaboration inquiries: contact@prismml.com

Downloads last month: 545

Safetensors

Model size

0.1B params

Tensor type

U32

F16

MLX

Hardware compatibility

Quantized

Model tree for prism-ml/Ternary-Bonsai-1.7B-mlx-2bit

Base model

prism-ml/Ternary-Bonsai-1.7B-unpacked

Finetuned

(2)

this model

Collection including prism-ml/Ternary-Bonsai-1.7B-mlx-2bit

Ternary Bonsai

Collection

1.58-bit Bonsai models • 9 items • Updated about 21 hours ago • 36

Evaluation results

Gsm8k on openai/gsm8k View evaluation results source leaderboard

74.2