Qwen3 1.7B Split-CoreML ANE Stack

This repository is not a vanilla Qwen3 checkpoint export.

It is a full-stack Core ML deployment of a Qwen3-1.7B-class decoder, re-authored around Apple runtime constraints: split MLPrograms, bounded state topology, reversed KV cache layout, mixed quantization, fp16 stabilizers, and an on-device sampler. The goal is not just "run Qwen3 in Core ML"; the goal is to make a Qwen3-shaped model survive ANE provisioning, hold a practical build-time context window, and generate on-device with a practical latency / memory profile.

This is also not only a build artifact repo. It is a deployment method: a specific way of making a Qwen3-class decoder fit Core ML graph constraints, ANE provisioning behavior, quantized fp16 numerics, and stateful autoregressive decode.

Deployment Envelope

The operational envelope is:

Property	Value
Base model family	`Qwen/Qwen3-1.7B`
Transformer blocks	`28`
Shard split	default published layout is left `0-10`, mid `11-19`, right `20-27`; builder also supports custom contiguous shard ranges
Hidden size	`2048`
Attention topology	GQA, `16` query heads / `8` KV heads
Default deployed context capacity	`4096` tokens
Context configurability	fixed at build time via `SEQ_LEN`, rebuild required to change it
Runtime batch size	fixed at `1`
Decode topology	stateful autoregressive decode with per-shard KV state
Sampling topology	separate sampler MLProgram, not host-only post-processing

The context window matters here because this project is not shipping an abstract checkpoint. It ships concrete Core ML shard packages with a fixed sequence-state footprint. In other words, "context window" is a deployment contract, not just a config field.

What This Actually Is

The default published system is a coordinated set of Core ML mlprograms:

Component	Placement	Role
`io_model`	CPU + GPU	Shared embedding path and LM-head/logit projection
`decoder_shard_left`	CPU + ANE	Layers 0-10 with local KV state
`decoder_shard_mid`	CPU + ANE	Layers 11-19 with local KV state
`decoder_shard_right`	CPU + ANE	Layers 20-27 plus final norm
`sampler_model`	CPU + GPU	Temperature / min-p / repetition / coherence-aware sampling

At runtime, the host loop does:

Token embedding through io_model
Hidden-state handoff through left -> mid -> right decoder shards
Final logits through io_model
Next-token selection through sampler_model or greedy decode

This is a split-graph autoregressive system, not a single monolithic model package.

Just as importantly, the repo is not limited to exactly one hardcoded three-way partition at build time. The generic shard builder supports:

canonical presets for left, mid, and right
custom contiguous layer ranges
explicit shard roles: entry, interior, and exit
special handling for entry-shard start constraints and exit-shard final-norm behavior

So the project has two different truths that both matter:

the shipped runtime path in this repo is the canonical three-shard left/mid/right deployment
the builder layer is more flexible than that and can emit other contiguous shard layouts

Why It Is Non-Vanilla

1. ANE-first graph partitioning

The central design problem is not only model size. It is the interaction between:

state count and state footprint
constant volume
graph shape and dynamic ops
dtype/op support on Core ML
model-level compute placement

The default deployment is therefore split into three shard MLPrograms, each with its own KV state set, because ANE viability depends on the shape of the graph as much as on parameter count.

That said, the builder is generalized around shard roles rather than around the literal names "left", "mid", and "right". Entry and exit shards are special, but interior partitioning is not conceptually locked to a single middle slice.

2. Reverse ring-buffer KV cache

KV cache updates are not handled with a naive append/shift scheme.

This stack uses a reversed ring layout so that active context always lives in a contiguous suffix of the sequence axis. New K/V values are written by masked blending instead of scatter-heavy updates. That keeps the Core ML graph much friendlier to ANE provisioning than a more obvious dynamic-cache implementation.

3. Mixed quantization, not one-size-fits-all quantization

The model uses two different weight-compression regimes because the embedding/head and the projection stack behave differently.

The shared embedding / LM head uses an OmniQuant-style blockwise weight-only path.
Attention and MLP projections use GS128 grouped LUT quantization with per-group scalars.
Bit allocation is mixed across layers and matrices rather than globally uniform.
Q/K are treated more conservatively than easier matrices, and MLP blocks can tolerate lower precision in selected places.

This is not "quantize everything the same way and hope". The quantization regime is part of the architecture.

4. Numerical stabilizers for fp16 Core ML execution

The stack adds several stabilizers that are essential under aggressive quantization:

safe dynamic RMSNorm
SLaNC pre-scales for hidden/Q/K paths
static RoPE tables
static causal mask tables
explicit fp16-friendly graph structure around attention and MLP paths

These pieces exist because naive fp16 Core ML execution degrades quickly under heavy quantization and long-context decode.

5. On-device sampler as a first-class model

Sampling is not treated as an afterthought on the host side.

The repo includes a dedicated sampler MLProgram with:

temperature
min-p pruning
repetition penalty
coherence modulation
noise injection for stochastic decode

That keeps the decode loop on-device and makes the sampler part of the deployment design rather than post-processing glue.

Context Window and State Topology

One thing that should not be buried in the README is the sequence contract.

The default build emits shard packages with SEQ_LEN=4096.
That 4096 is the deployed Core ML context capacity, not the upstream Qwen3 theoretical maximum.
The decoder builders can read upstream max_position_embeddings, seq_length, or context_length, but the local stack script pins SEQ_LEN=4096 unless overridden.
Batch is fixed at 1, so the runtime is optimized for one stateful stream rather than multi-request batching.
Each shard owns its own KV state tensors for its layer range, which is one reason the context window is a real deployment-budget decision instead of an incidental config knob.

Changing the context window is therefore a rebuild operation, not a runtime flag flip. If you want 8192 or another capacity, you regenerate the Core ML packages with a different SEQ_LEN and accept the corresponding state / constant tradeoffs.

Shard Flexibility

The builder surface is more flexible than the runtime surface.

tools/build/build_decoder_shard.py supports canonical presets for left, mid, and right.
The same entrypoint also supports custom shards via --layer-start, --layer-end, --role, and final-norm selection.
Entry shards are special because they must start at layer 0.
Exit shards are special because they must end at the final decoder layer and apply final norm.
Interior shards can cover other contiguous layer ranges and are not conceptually restricted to one specific "mid" definition.

The current runtime and packaged release path still assume the canonical three-package deployment:

decoder_shard_left
decoder_shard_mid
decoder_shard_right

So the honest description is: the build system is generalized; the public runtime path in this repo is still the canonical three-shard stack.

Model Shape

This stack targets a Qwen3-1.7B-class decoder with:

28 transformer blocks
hidden size 2048
grouped query attention with 16 query heads and 8 KV heads
RoPE base 1e6
vocabulary size 151,936
shared embedding / LM-head weights at the logical level
default deployed context capacity 4096 tokens
build-time-fixed batch size 1

What is preserved:

decoder semantics
grouped-query attention structure
RoPE-based position encoding
tied embedding/logit behavior

What is intentionally not preserved:

upstream storage format
upstream KV-cache implementation
upstream sampler behavior
upstream single-graph deployment assumptions

Repository Layout

The important directories are:

src/qwen3_coreml/ Core runtime, builders, quantization helpers, contracts, and sampler reference logic.
tools/build/ Public build entrypoints for the Core ML packages.
tools/run/ Runtime CLI for prefill + generation with the split model stack.
assets/upstream/qwen3_1_7b/ Upstream Hugging Face config/tokenizer/weights.
assets/quantization/ Quantized embedding and projection payloads used by the builders.
artifacts/coreml/packages/ Generated .mlpackage outputs.
artifacts/coreml/mil/ Generated MIL dumps for inspection.
artifacts/numerics/ Generated SLaNC scale files.

This README intentionally focuses on the public build/run path. Local validation and one-off investigative tooling are not the primary public surface.

Build Inputs

The build expects three classes of inputs:

Upstream Qwen3 assets
- config.json
- tokenizer JSON
- Hugging Face safetensors
Quantized payloads
- OmniQuant embedding/LM-head weights and scales
- GS128 LUT projection packs and metadata
Local build output area
- artifacts/ for scales, MIL dumps, temp packages, and final Core ML bundles

The decoder build path now supports either:

a consolidated model.safetensors
or Hugging Face sharded weights via model.safetensors.index.json

So a local full merge is no longer required before building.

Build Outputs

The normal build emits:

artifacts/coreml/packages/io_model.mlpackage
artifacts/coreml/packages/decoder_shard_left.mlpackage
artifacts/coreml/packages/decoder_shard_mid.mlpackage
artifacts/coreml/packages/decoder_shard_right.mlpackage
artifacts/coreml/packages/sampler_model.mlpackage
artifacts/numerics/slanc_scales.npy
MIL dumps under artifacts/coreml/mil/

Build

The simplest build path is the stack script:

bash tools/build/build_coreml_stack.sh

In practice you will usually run:

PYTHON_BIN=.venv/bin/python bash tools/build/build_coreml_stack.sh

Key environment variables:

CONFIG
WEIGHTS
WEIGHTS_INDEX
WEIGHTS_GLOB
LUT_DIR
OMNIQ_DIR
SCALES
OUT_DIR
MIL_DIR
TMP_DIR
SEQ_LEN
PALETTIZE_MASKS

Default stack behavior:

computes SLaNC scales
builds io_model
builds the canonical left / mid / right decoder shards
palettizes KV/mask constants in shard packages
builds the sampler model

By default the stack builds with SEQ_LEN=4096. That is the deployed Core ML context capacity for the generated shard packages, not the upstream Qwen3 theoretical max context.

If you want a different shard partition, use the generic shard builder directly instead of the convenience stack script.

Run

Example:

python tools/run/run_inference.py \
  --models-dir artifacts/coreml/packages \
  --tokenizer assets/upstream/qwen3_1_7b/tokenizer.json \
  --prompt "What is RSA?" \
  --max-new-tokens 80 \
  --decode-mode greedy \
  --stream-mode word \
  --stats

Useful runtime notes:

decoder shards default to CPU_AND_NE
io_model and sampler_model default to CPU_AND_GPU
no visible output appears until prompt prefill completes
--prefill-progress is useful for long prompts
batch size is fixed at 1
generation can expose reasoning tokens such as <think> if the model emits them

The first ANE-oriented run may spend a long time in Core ML / ANE compilation before token output starts. That is expected behavior for this stack and should not be confused with a dead process.

Perplexity Parity Results

This project is not only architecturally unusual; it has also been checked against the upstream Hugging Face safetensors baseline on streamed wikitext-103-raw-v1 windows.

The comparison setup below used:

upstream Hugging Face safetensors as the Torch baseline
the split Core ML shard stack on the ANE-oriented default path
streamed, contiguous windows from wikitext-103-raw-v1
identical tokenizer and prompt-token stream on both sides

Streamed sample	Torch PPL	Split Core ML PPL	Delta
`24 x 256` tokens	`30.6918`	`30.7140`	`+0.0222`
`24 x 1024` tokens	`17.8666`	`17.3794`	`-0.4873`
`24 x 2048` tokens	`16.8072`	`16.2345`	`-0.5727`

Benchmark artifacts:

artifacts/ppl_compare_wikitext103raw_stream_24w_256.json
artifacts/ppl_compare_wikitext103raw_stream_24w_1024.json
artifacts/ppl_compare_wikitext103raw_stream_24w_2048.json

These numbers matter because they show that the split ANE-targeted runtime is not merely "working" in the sense of producing tokens. It stays close to the upstream Torch baseline under streamed long-window evaluation, despite graph partitioning, custom KV-state topology, mixed quantization, and fp16 Core ML execution.

What This Repository Is Not

It is not:

a plain Hugging Face Transformers checkpoint repo
a single exported Core ML graph
a custom ANE kernel project
a generic "quantized Qwen" dump with minimal glue

The value of the project is in the interaction between graph partitioning, state engineering, quantization, numerical stabilization, and sampler design.

If You Publish This On Hugging Face

The natural release payload is:

the five generated .mlpackage bundles, usually zipped for distribution
tokenizer.json
config.json
generation_config.json
a model card explaining the split runtime and non-vanilla methods
the appropriate upstream license / notice files if upstream assets are redistributed

If the goal is a clean HF model repo, do not upload local experiment clutter, local virtual environments, temporary build junk, or unrelated validation workspaces. Publish the runnable Core ML surface, not the entire machine state that produced it.

License

The repository code and build logic are under Apache 2.0. See LICENSE.

Upstream Qwen assets remain under their own license terms. If you redistribute upstream weights, tokenizer files, or config files, include the upstream license / notice material and follow the upstream redistribution terms.

Downloads last month: 1,702

Model tree for pkhairkh/qwen3-coreml-palettized

Base model

Qwen/Qwen3-1.7B-Base

Finetuned

Qwen/Qwen3-1.7B

Quantized

(247)

this model