Qwen3 1.7B Split-CoreML ANE Stack
This repository is not a vanilla Qwen3 checkpoint export.
It is a full-stack Core ML deployment of a Qwen3-1.7B-class decoder, re-authored around Apple runtime constraints: split MLPrograms, bounded state topology, reversed KV cache layout, mixed quantization, fp16 stabilizers, and an on-device sampler. The goal is not just "run Qwen3 in Core ML"; the goal is to make a Qwen3-shaped model survive ANE provisioning, hold a practical build-time context window, and generate on-device with a practical latency / memory profile.
This is also not only a build artifact repo. It is a deployment method: a specific way of making a Qwen3-class decoder fit Core ML graph constraints, ANE provisioning behavior, quantized fp16 numerics, and stateful autoregressive decode.
Deployment Envelope
The operational envelope is:
| Property | Value |
|---|---|
| Base model family | Qwen/Qwen3-1.7B |
| Transformer blocks | 28 |
| Shard split | default published layout is left 0-10, mid 11-19, right 20-27; builder also supports custom contiguous shard ranges |
| Hidden size | 2048 |
| Attention topology | GQA, 16 query heads / 8 KV heads |
| Default deployed context capacity | 4096 tokens |
| Context configurability | fixed at build time via SEQ_LEN, rebuild required to change it |
| Runtime batch size | fixed at 1 |
| Decode topology | stateful autoregressive decode with per-shard KV state |
| Sampling topology | separate sampler MLProgram, not host-only post-processing |
The context window matters here because this project is not shipping an abstract checkpoint. It ships concrete Core ML shard packages with a fixed sequence-state footprint. In other words, "context window" is a deployment contract, not just a config field.
What This Actually Is
The default published system is a coordinated set of Core ML mlprograms:
| Component | Placement | Role |
|---|---|---|
io_model |
CPU + GPU | Shared embedding path and LM-head/logit projection |
decoder_shard_left |
CPU + ANE | Layers 0-10 with local KV state |
decoder_shard_mid |
CPU + ANE | Layers 11-19 with local KV state |
decoder_shard_right |
CPU + ANE | Layers 20-27 plus final norm |
sampler_model |
CPU + GPU | Temperature / min-p / repetition / coherence-aware sampling |
At runtime, the host loop does:
- Token embedding through
io_model - Hidden-state handoff through left -> mid -> right decoder shards
- Final logits through
io_model - Next-token selection through
sampler_modelor greedy decode
This is a split-graph autoregressive system, not a single monolithic model package.
Just as importantly, the repo is not limited to exactly one hardcoded three-way partition at build time. The generic shard builder supports:
- canonical presets for
left,mid, andright - custom contiguous layer ranges
- explicit shard roles:
entry,interior, andexit - special handling for entry-shard start constraints and exit-shard final-norm behavior
So the project has two different truths that both matter:
- the shipped runtime path in this repo is the canonical three-shard left/mid/right deployment
- the builder layer is more flexible than that and can emit other contiguous shard layouts
Why It Is Non-Vanilla
1. ANE-first graph partitioning
The central design problem is not only model size. It is the interaction between:
- state count and state footprint
- constant volume
- graph shape and dynamic ops
- dtype/op support on Core ML
- model-level compute placement
The default deployment is therefore split into three shard MLPrograms, each with its own KV state set, because ANE viability depends on the shape of the graph as much as on parameter count.
That said, the builder is generalized around shard roles rather than around the literal names "left", "mid", and "right". Entry and exit shards are special, but interior partitioning is not conceptually locked to a single middle slice.
2. Reverse ring-buffer KV cache
KV cache updates are not handled with a naive append/shift scheme.
This stack uses a reversed ring layout so that active context always lives in a contiguous suffix of the sequence axis. New K/V values are written by masked blending instead of scatter-heavy updates. That keeps the Core ML graph much friendlier to ANE provisioning than a more obvious dynamic-cache implementation.
3. Mixed quantization, not one-size-fits-all quantization
The model uses two different weight-compression regimes because the embedding/head and the projection stack behave differently.
- The shared embedding / LM head uses an OmniQuant-style blockwise weight-only path.
- Attention and MLP projections use GS128 grouped LUT quantization with per-group scalars.
- Bit allocation is mixed across layers and matrices rather than globally uniform.
- Q/K are treated more conservatively than easier matrices, and MLP blocks can tolerate lower precision in selected places.
This is not "quantize everything the same way and hope". The quantization regime is part of the architecture.
4. Numerical stabilizers for fp16 Core ML execution
The stack adds several stabilizers that are essential under aggressive quantization:
- safe dynamic RMSNorm
- SLaNC pre-scales for hidden/Q/K paths
- static RoPE tables
- static causal mask tables
- explicit fp16-friendly graph structure around attention and MLP paths
These pieces exist because naive fp16 Core ML execution degrades quickly under heavy quantization and long-context decode.
5. On-device sampler as a first-class model
Sampling is not treated as an afterthought on the host side.
The repo includes a dedicated sampler MLProgram with:
- temperature
- min-p pruning
- repetition penalty
- coherence modulation
- noise injection for stochastic decode
That keeps the decode loop on-device and makes the sampler part of the deployment design rather than post-processing glue.
Context Window and State Topology
One thing that should not be buried in the README is the sequence contract.
- The default build emits shard packages with
SEQ_LEN=4096. - That
4096is the deployed Core ML context capacity, not the upstream Qwen3 theoretical maximum. - The decoder builders can read upstream
max_position_embeddings,seq_length, orcontext_length, but the local stack script pinsSEQ_LEN=4096unless overridden. - Batch is fixed at
1, so the runtime is optimized for one stateful stream rather than multi-request batching. - Each shard owns its own KV state tensors for its layer range, which is one reason the context window is a real deployment-budget decision instead of an incidental config knob.
Changing the context window is therefore a rebuild operation, not a runtime flag flip. If you want 8192 or another capacity, you regenerate the Core ML packages with a different SEQ_LEN and accept the corresponding state / constant tradeoffs.
Shard Flexibility
The builder surface is more flexible than the runtime surface.
tools/build/build_decoder_shard.pysupports canonical presets forleft,mid, andright.- The same entrypoint also supports custom shards via
--layer-start,--layer-end,--role, and final-norm selection. - Entry shards are special because they must start at layer
0. - Exit shards are special because they must end at the final decoder layer and apply final norm.
- Interior shards can cover other contiguous layer ranges and are not conceptually restricted to one specific "mid" definition.
The current runtime and packaged release path still assume the canonical three-package deployment:
decoder_shard_leftdecoder_shard_middecoder_shard_right
So the honest description is: the build system is generalized; the public runtime path in this repo is still the canonical three-shard stack.
Model Shape
This stack targets a Qwen3-1.7B-class decoder with:
- 28 transformer blocks
- hidden size 2048
- grouped query attention with 16 query heads and 8 KV heads
- RoPE base
1e6 - vocabulary size
151,936 - shared embedding / LM-head weights at the logical level
- default deployed context capacity
4096tokens - build-time-fixed batch size
1
What is preserved:
- decoder semantics
- grouped-query attention structure
- RoPE-based position encoding
- tied embedding/logit behavior
What is intentionally not preserved:
- upstream storage format
- upstream KV-cache implementation
- upstream sampler behavior
- upstream single-graph deployment assumptions
Repository Layout
The important directories are:
src/qwen3_coreml/Core runtime, builders, quantization helpers, contracts, and sampler reference logic.tools/build/Public build entrypoints for the Core ML packages.tools/run/Runtime CLI for prefill + generation with the split model stack.assets/upstream/qwen3_1_7b/Upstream Hugging Face config/tokenizer/weights.assets/quantization/Quantized embedding and projection payloads used by the builders.artifacts/coreml/packages/Generated.mlpackageoutputs.artifacts/coreml/mil/Generated MIL dumps for inspection.artifacts/numerics/Generated SLaNC scale files.
This README intentionally focuses on the public build/run path. Local validation and one-off investigative tooling are not the primary public surface.
Build Inputs
The build expects three classes of inputs:
Upstream Qwen3 assets
config.json- tokenizer JSON
- Hugging Face safetensors
Quantized payloads
- OmniQuant embedding/LM-head weights and scales
- GS128 LUT projection packs and metadata
Local build output area
artifacts/for scales, MIL dumps, temp packages, and final Core ML bundles
The decoder build path now supports either:
- a consolidated
model.safetensors - or Hugging Face sharded weights via
model.safetensors.index.json
So a local full merge is no longer required before building.
Build Outputs
The normal build emits:
artifacts/coreml/packages/io_model.mlpackageartifacts/coreml/packages/decoder_shard_left.mlpackageartifacts/coreml/packages/decoder_shard_mid.mlpackageartifacts/coreml/packages/decoder_shard_right.mlpackageartifacts/coreml/packages/sampler_model.mlpackageartifacts/numerics/slanc_scales.npy- MIL dumps under
artifacts/coreml/mil/
Build
The simplest build path is the stack script:
bash tools/build/build_coreml_stack.sh
In practice you will usually run:
PYTHON_BIN=.venv/bin/python bash tools/build/build_coreml_stack.sh
Key environment variables:
CONFIGWEIGHTSWEIGHTS_INDEXWEIGHTS_GLOBLUT_DIROMNIQ_DIRSCALESOUT_DIRMIL_DIRTMP_DIRSEQ_LENPALETTIZE_MASKS
Default stack behavior:
- computes SLaNC scales
- builds
io_model - builds the canonical left / mid / right decoder shards
- palettizes KV/mask constants in shard packages
- builds the sampler model
By default the stack builds with SEQ_LEN=4096. That is the deployed Core ML context capacity for the generated shard packages, not the upstream Qwen3 theoretical max context.
If you want a different shard partition, use the generic shard builder directly instead of the convenience stack script.
Run
Example:
python tools/run/run_inference.py \
--models-dir artifacts/coreml/packages \
--tokenizer assets/upstream/qwen3_1_7b/tokenizer.json \
--prompt "What is RSA?" \
--max-new-tokens 80 \
--decode-mode greedy \
--stream-mode word \
--stats
Useful runtime notes:
- decoder shards default to
CPU_AND_NE io_modelandsampler_modeldefault toCPU_AND_GPU- no visible output appears until prompt prefill completes
--prefill-progressis useful for long prompts- batch size is fixed at
1 - generation can expose reasoning tokens such as
<think>if the model emits them
The first ANE-oriented run may spend a long time in Core ML / ANE compilation before token output starts. That is expected behavior for this stack and should not be confused with a dead process.
Perplexity Parity Results
This project is not only architecturally unusual; it has also been checked against the upstream Hugging Face safetensors baseline on streamed wikitext-103-raw-v1 windows.
The comparison setup below used:
- upstream Hugging Face safetensors as the Torch baseline
- the split Core ML shard stack on the ANE-oriented
defaultpath - streamed, contiguous windows from
wikitext-103-raw-v1 - identical tokenizer and prompt-token stream on both sides
| Streamed sample | Torch PPL | Split Core ML PPL | Delta |
|---|---|---|---|
24 x 256 tokens |
30.6918 |
30.7140 |
+0.0222 |
24 x 1024 tokens |
17.8666 |
17.3794 |
-0.4873 |
24 x 2048 tokens |
16.8072 |
16.2345 |
-0.5727 |
Benchmark artifacts:
artifacts/ppl_compare_wikitext103raw_stream_24w_256.jsonartifacts/ppl_compare_wikitext103raw_stream_24w_1024.jsonartifacts/ppl_compare_wikitext103raw_stream_24w_2048.json
These numbers matter because they show that the split ANE-targeted runtime is not merely "working" in the sense of producing tokens. It stays close to the upstream Torch baseline under streamed long-window evaluation, despite graph partitioning, custom KV-state topology, mixed quantization, and fp16 Core ML execution.
What This Repository Is Not
It is not:
- a plain Hugging Face Transformers checkpoint repo
- a single exported Core ML graph
- a custom ANE kernel project
- a generic "quantized Qwen" dump with minimal glue
The value of the project is in the interaction between graph partitioning, state engineering, quantization, numerical stabilization, and sampler design.
If You Publish This On Hugging Face
The natural release payload is:
- the five generated
.mlpackagebundles, usually zipped for distribution tokenizer.jsonconfig.jsongeneration_config.json- a model card explaining the split runtime and non-vanilla methods
- the appropriate upstream license / notice files if upstream assets are redistributed
If the goal is a clean HF model repo, do not upload local experiment clutter, local virtual environments, temporary build junk, or unrelated validation workspaces. Publish the runnable Core ML surface, not the entire machine state that produced it.
License
The repository code and build logic are under Apache 2.0. See LICENSE.
Upstream Qwen assets remain under their own license terms. If you redistribute upstream weights, tokenizer files, or config files, include the upstream license / notice material and follow the upstream redistribution terms.
- Downloads last month
- 1,702