Qwen3 1.7B Split-CoreML ANE Stack

This repository is not a vanilla Qwen3 checkpoint export.

It is a full-stack Core ML deployment of a Qwen3-1.7B-class decoder, re-authored around Apple runtime constraints: split MLPrograms, bounded state topology, reversed KV cache layout, mixed quantization, fp16 stabilizers, and an on-device sampler. The goal is not just "run Qwen3 in Core ML"; the goal is to make a Qwen3-shaped model survive ANE provisioning, hold a practical build-time context window, and generate on-device with a practical latency / memory profile.

This is also not only a build artifact repo. It is a deployment method: a specific way of making a Qwen3-class decoder fit Core ML graph constraints, ANE provisioning behavior, quantized fp16 numerics, and stateful autoregressive decode.

Deployment Envelope

The operational envelope is:

Property Value
Base model family Qwen/Qwen3-1.7B
Transformer blocks 28
Shard split default published layout is left 0-10, mid 11-19, right 20-27; builder also supports custom contiguous shard ranges
Hidden size 2048
Attention topology GQA, 16 query heads / 8 KV heads
Default deployed context capacity 4096 tokens
Context configurability fixed at build time via SEQ_LEN, rebuild required to change it
Runtime batch size fixed at 1
Decode topology stateful autoregressive decode with per-shard KV state
Sampling topology separate sampler MLProgram, not host-only post-processing

The context window matters here because this project is not shipping an abstract checkpoint. It ships concrete Core ML shard packages with a fixed sequence-state footprint. In other words, "context window" is a deployment contract, not just a config field.

What This Actually Is

The default published system is a coordinated set of Core ML mlprograms:

Component Placement Role
io_model CPU + GPU Shared embedding path and LM-head/logit projection
decoder_shard_left CPU + ANE Layers 0-10 with local KV state
decoder_shard_mid CPU + ANE Layers 11-19 with local KV state
decoder_shard_right CPU + ANE Layers 20-27 plus final norm
sampler_model CPU + GPU Temperature / min-p / repetition / coherence-aware sampling

At runtime, the host loop does:

  1. Token embedding through io_model
  2. Hidden-state handoff through left -> mid -> right decoder shards
  3. Final logits through io_model
  4. Next-token selection through sampler_model or greedy decode

This is a split-graph autoregressive system, not a single monolithic model package.

Just as importantly, the repo is not limited to exactly one hardcoded three-way partition at build time. The generic shard builder supports:

  • canonical presets for left, mid, and right
  • custom contiguous layer ranges
  • explicit shard roles: entry, interior, and exit
  • special handling for entry-shard start constraints and exit-shard final-norm behavior

So the project has two different truths that both matter:

  • the shipped runtime path in this repo is the canonical three-shard left/mid/right deployment
  • the builder layer is more flexible than that and can emit other contiguous shard layouts

Why It Is Non-Vanilla

1. ANE-first graph partitioning

The central design problem is not only model size. It is the interaction between:

  • state count and state footprint
  • constant volume
  • graph shape and dynamic ops
  • dtype/op support on Core ML
  • model-level compute placement

The default deployment is therefore split into three shard MLPrograms, each with its own KV state set, because ANE viability depends on the shape of the graph as much as on parameter count.

That said, the builder is generalized around shard roles rather than around the literal names "left", "mid", and "right". Entry and exit shards are special, but interior partitioning is not conceptually locked to a single middle slice.

2. Reverse ring-buffer KV cache

KV cache updates are not handled with a naive append/shift scheme.

This stack uses a reversed ring layout so that active context always lives in a contiguous suffix of the sequence axis. New K/V values are written by masked blending instead of scatter-heavy updates. That keeps the Core ML graph much friendlier to ANE provisioning than a more obvious dynamic-cache implementation.

3. Mixed quantization, not one-size-fits-all quantization

The model uses two different weight-compression regimes because the embedding/head and the projection stack behave differently.

  • The shared embedding / LM head uses an OmniQuant-style blockwise weight-only path.
  • Attention and MLP projections use GS128 grouped LUT quantization with per-group scalars.
  • Bit allocation is mixed across layers and matrices rather than globally uniform.
  • Q/K are treated more conservatively than easier matrices, and MLP blocks can tolerate lower precision in selected places.

This is not "quantize everything the same way and hope". The quantization regime is part of the architecture.

4. Numerical stabilizers for fp16 Core ML execution

The stack adds several stabilizers that are essential under aggressive quantization:

  • safe dynamic RMSNorm
  • SLaNC pre-scales for hidden/Q/K paths
  • static RoPE tables
  • static causal mask tables
  • explicit fp16-friendly graph structure around attention and MLP paths

These pieces exist because naive fp16 Core ML execution degrades quickly under heavy quantization and long-context decode.

5. On-device sampler as a first-class model

Sampling is not treated as an afterthought on the host side.

The repo includes a dedicated sampler MLProgram with:

  • temperature
  • min-p pruning
  • repetition penalty
  • coherence modulation
  • noise injection for stochastic decode

That keeps the decode loop on-device and makes the sampler part of the deployment design rather than post-processing glue.

Context Window and State Topology

One thing that should not be buried in the README is the sequence contract.

  • The default build emits shard packages with SEQ_LEN=4096.
  • That 4096 is the deployed Core ML context capacity, not the upstream Qwen3 theoretical maximum.
  • The decoder builders can read upstream max_position_embeddings, seq_length, or context_length, but the local stack script pins SEQ_LEN=4096 unless overridden.
  • Batch is fixed at 1, so the runtime is optimized for one stateful stream rather than multi-request batching.
  • Each shard owns its own KV state tensors for its layer range, which is one reason the context window is a real deployment-budget decision instead of an incidental config knob.

Changing the context window is therefore a rebuild operation, not a runtime flag flip. If you want 8192 or another capacity, you regenerate the Core ML packages with a different SEQ_LEN and accept the corresponding state / constant tradeoffs.

Shard Flexibility

The builder surface is more flexible than the runtime surface.

  • tools/build/build_decoder_shard.py supports canonical presets for left, mid, and right.
  • The same entrypoint also supports custom shards via --layer-start, --layer-end, --role, and final-norm selection.
  • Entry shards are special because they must start at layer 0.
  • Exit shards are special because they must end at the final decoder layer and apply final norm.
  • Interior shards can cover other contiguous layer ranges and are not conceptually restricted to one specific "mid" definition.

The current runtime and packaged release path still assume the canonical three-package deployment:

  • decoder_shard_left
  • decoder_shard_mid
  • decoder_shard_right

So the honest description is: the build system is generalized; the public runtime path in this repo is still the canonical three-shard stack.

Model Shape

This stack targets a Qwen3-1.7B-class decoder with:

  • 28 transformer blocks
  • hidden size 2048
  • grouped query attention with 16 query heads and 8 KV heads
  • RoPE base 1e6
  • vocabulary size 151,936
  • shared embedding / LM-head weights at the logical level
  • default deployed context capacity 4096 tokens
  • build-time-fixed batch size 1

What is preserved:

  • decoder semantics
  • grouped-query attention structure
  • RoPE-based position encoding
  • tied embedding/logit behavior

What is intentionally not preserved:

  • upstream storage format
  • upstream KV-cache implementation
  • upstream sampler behavior
  • upstream single-graph deployment assumptions

Repository Layout

The important directories are:

  • src/qwen3_coreml/ Core runtime, builders, quantization helpers, contracts, and sampler reference logic.
  • tools/build/ Public build entrypoints for the Core ML packages.
  • tools/run/ Runtime CLI for prefill + generation with the split model stack.
  • assets/upstream/qwen3_1_7b/ Upstream Hugging Face config/tokenizer/weights.
  • assets/quantization/ Quantized embedding and projection payloads used by the builders.
  • artifacts/coreml/packages/ Generated .mlpackage outputs.
  • artifacts/coreml/mil/ Generated MIL dumps for inspection.
  • artifacts/numerics/ Generated SLaNC scale files.

This README intentionally focuses on the public build/run path. Local validation and one-off investigative tooling are not the primary public surface.

Build Inputs

The build expects three classes of inputs:

  1. Upstream Qwen3 assets

    • config.json
    • tokenizer JSON
    • Hugging Face safetensors
  2. Quantized payloads

    • OmniQuant embedding/LM-head weights and scales
    • GS128 LUT projection packs and metadata
  3. Local build output area

    • artifacts/ for scales, MIL dumps, temp packages, and final Core ML bundles

The decoder build path now supports either:

  • a consolidated model.safetensors
  • or Hugging Face sharded weights via model.safetensors.index.json

So a local full merge is no longer required before building.

Build Outputs

The normal build emits:

  • artifacts/coreml/packages/io_model.mlpackage
  • artifacts/coreml/packages/decoder_shard_left.mlpackage
  • artifacts/coreml/packages/decoder_shard_mid.mlpackage
  • artifacts/coreml/packages/decoder_shard_right.mlpackage
  • artifacts/coreml/packages/sampler_model.mlpackage
  • artifacts/numerics/slanc_scales.npy
  • MIL dumps under artifacts/coreml/mil/

Build

The simplest build path is the stack script:

bash tools/build/build_coreml_stack.sh

In practice you will usually run:

PYTHON_BIN=.venv/bin/python bash tools/build/build_coreml_stack.sh

Key environment variables:

  • CONFIG
  • WEIGHTS
  • WEIGHTS_INDEX
  • WEIGHTS_GLOB
  • LUT_DIR
  • OMNIQ_DIR
  • SCALES
  • OUT_DIR
  • MIL_DIR
  • TMP_DIR
  • SEQ_LEN
  • PALETTIZE_MASKS

Default stack behavior:

  • computes SLaNC scales
  • builds io_model
  • builds the canonical left / mid / right decoder shards
  • palettizes KV/mask constants in shard packages
  • builds the sampler model

By default the stack builds with SEQ_LEN=4096. That is the deployed Core ML context capacity for the generated shard packages, not the upstream Qwen3 theoretical max context.

If you want a different shard partition, use the generic shard builder directly instead of the convenience stack script.

Run

Example:

python tools/run/run_inference.py \
  --models-dir artifacts/coreml/packages \
  --tokenizer assets/upstream/qwen3_1_7b/tokenizer.json \
  --prompt "What is RSA?" \
  --max-new-tokens 80 \
  --decode-mode greedy \
  --stream-mode word \
  --stats

Useful runtime notes:

  • decoder shards default to CPU_AND_NE
  • io_model and sampler_model default to CPU_AND_GPU
  • no visible output appears until prompt prefill completes
  • --prefill-progress is useful for long prompts
  • batch size is fixed at 1
  • generation can expose reasoning tokens such as <think> if the model emits them

The first ANE-oriented run may spend a long time in Core ML / ANE compilation before token output starts. That is expected behavior for this stack and should not be confused with a dead process.

Perplexity Parity Results

This project is not only architecturally unusual; it has also been checked against the upstream Hugging Face safetensors baseline on streamed wikitext-103-raw-v1 windows.

The comparison setup below used:

  • upstream Hugging Face safetensors as the Torch baseline
  • the split Core ML shard stack on the ANE-oriented default path
  • streamed, contiguous windows from wikitext-103-raw-v1
  • identical tokenizer and prompt-token stream on both sides
Streamed sample Torch PPL Split Core ML PPL Delta
24 x 256 tokens 30.6918 30.7140 +0.0222
24 x 1024 tokens 17.8666 17.3794 -0.4873
24 x 2048 tokens 16.8072 16.2345 -0.5727

Benchmark artifacts:

  • artifacts/ppl_compare_wikitext103raw_stream_24w_256.json
  • artifacts/ppl_compare_wikitext103raw_stream_24w_1024.json
  • artifacts/ppl_compare_wikitext103raw_stream_24w_2048.json

These numbers matter because they show that the split ANE-targeted runtime is not merely "working" in the sense of producing tokens. It stays close to the upstream Torch baseline under streamed long-window evaluation, despite graph partitioning, custom KV-state topology, mixed quantization, and fp16 Core ML execution.

What This Repository Is Not

It is not:

  • a plain Hugging Face Transformers checkpoint repo
  • a single exported Core ML graph
  • a custom ANE kernel project
  • a generic "quantized Qwen" dump with minimal glue

The value of the project is in the interaction between graph partitioning, state engineering, quantization, numerical stabilization, and sampler design.

If You Publish This On Hugging Face

The natural release payload is:

  • the five generated .mlpackage bundles, usually zipped for distribution
  • tokenizer.json
  • config.json
  • generation_config.json
  • a model card explaining the split runtime and non-vanilla methods
  • the appropriate upstream license / notice files if upstream assets are redistributed

If the goal is a clean HF model repo, do not upload local experiment clutter, local virtual environments, temporary build junk, or unrelated validation workspaces. Publish the runnable Core ML surface, not the entire machine state that produced it.

License

The repository code and build logic are under Apache 2.0. See LICENSE.

Upstream Qwen assets remain under their own license terms. If you redistribute upstream weights, tokenizer files, or config files, include the upstream license / notice material and follow the upstream redistribution terms.

Downloads last month
1,702
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for pkhairkh/qwen3-coreml-palettized

Finetuned
Qwen/Qwen3-1.7B
Quantized
(247)
this model