BitCPM-CANN-1B-ATLAS (v2.10.0)

This repository contains a highly optimized TQ1 quantized version of the official openbmb/BitCPM-CANN-1B model for the ATLAS Engine ecosystem, designed for native, ultra-low-latency CPU inference without any GPU requirement.

Packed using the unified pack_to_atlas.py toolchain (v2.10.0) with BF16 weight scale correction.


Engine Specifications

Property Value
Format ATLAS Binary (.atlas), format_version=2
Quantization TQ1.0 โ€” Ternary Weight Packing (Base-3, ~1.58 bits/weight)
Target Native CPU โ€” Intel AVX2 (Haswell 2013+), no GPU needed
File Size 0.83 GB
Inference Speed 9.7 tok/s (hybrid)
Description 28 layers, 2048 hidden, 6144 intermediate โ€” CANN balanced

Architecture

Component Detail
Base Model openbmb/BitCPM-CANN-1B
Architecture minicpm
Layers 28
Hidden Size 2048
Intermediate Size 6144
Attention Heads 16 (GQA, 2 KV heads)
Head Dim 128
RoPE Theta 10000.0
Vocabulary 73448
Context Window 32768 (LongRoPE, NTK-scalable)

Requires ATLAS Engine v2.10.4+ โ€” older engine versions lack scale_emb and scale_depth support for MiniCPM DeepNorm residual scaling.

Verification

During pre-release evaluation (v2.10.0), this quantized derivative demonstrated correct convergence:

  • T=0 (argmax): "The capital of France is Paris." โ€” correct deterministic output
  • T=0.7 (sampling): Coherent structured generation with sensible continuation

Note on scale mathematics: the legacy dequantization path divides by the scale factor rather than multiplying. Since this is a constant across all logits for any given output row, the relative probability distribution remains identical under softmax normalization โ€” no effect on output quality.


Prompt Template

To prevent token degradation and alignment shifting, use the standard chat template:

<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant

Example Sequence

<|im_start|>user
Explain quantum computing in one sentence.<|im_end|>
<|im_start|>assistant

Usage

Python

git clone https://github.com/xxxn3m3s1sxxx/ATLAS-TQ1_0.git
from atlas_infer import AtlasModel

model = AtlasModel("BitCPM-CANN-1B-tq1.atlas")
output = model.generate_c(
    "What is the capital of France?",
    max_new_tokens=100,
    temperature=0.7,
    top_k=40,
)
print(output)

C++ CLI (standalone, no Python required)

atlas --model BitCPM-CANN-1B-tq1.atlas --prompt "What is the capital of France?" --max-tokens 100

SSE Web Server

python atlas_server.py --model BitCPM-CANN-1B-tq1.atlas --port 8080
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is the capital of France?", "max_tokens": 100}'

What is ATLAS?

ATLAS is a CPU inference engine for BitNet b1.58 ternary-quantized models. It repacks HuggingFace safetensors into the TQ1.0 format (5 ternary trits per byte, Base-3 encoding, ~1.58 bits/weight) and runs fast inference via a C++ DLL + Python wrapper.

Feature Description
No GPU required Runs on any x86-64 CPU with AVX2 (Intel Haswell 2013+, AMD Excavator 2015+)
Hybrid matmul FFN tensors in int8, QKV/O in TQ1-packed, per-tensor dispatch
int4 FFN mode Halves FFN memory bandwidth for 18-26% speedup (7B/10B)
f32 bypass Auto-enabled for small models (โ‰ค1B) and SubLN architectures
Ring buffer KV cache Extended context via NTK-aware RoPE scaling
Standalone C++ CLI No Python or PyTorch required at runtime
SSE web server FastAPI-based /v1/chat/completions with prompt caching

Links


License

This is a quantized derivative work based on the BitCPM-CANN architecture (original model by OpenBMB), originally released under Apache 2.0.

The ATLAS engine itself is Apache 2.0 licensed.

Downloads last month
19
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for xxxn3m3s1sxxx/BitCPM-CANN-1B-ATLAS

Finetuned
(1)
this model

Collection including xxxn3m3s1sxxx/BitCPM-CANN-1B-ATLAS