anemll
/

anemll-google-gemma-3-1b-it-ctx4096_0.3.4

Apple Neural Engine

Model card Files Files and versions

anemll commited on Jan 30

Commit

0449180

·

verified ·

1 Parent(s): f305d12

Update README.md

Files changed (1) hide show

README.md +13 -12

README.md CHANGED Viewed

@@ -13,23 +13,24 @@ tags:
 ---
 # ANEMLL
-QAT version quantized for Apple Neural Engine.
-Gemma3 models trained with BF16 can produce residual stream activations that exceed FP16's representable range (±65,504). FP16 inference is a hard requirement for ANE. This causes:
-- Overflow to `inf` in FP16 computation
-- NaN propagation through subsequent layers
-- Complete model failure on ANE (which uses FP16)
-## Model Quality Benchmarks
-### FP16 Scaling for ANE Compatibility
-Gemma3 4B QAT models produce activations that exceed FP16 range (±65,504) during inference. We apply **weight scaling (α=0.1875)** to prevent overflow:
-- Embedding weights scaled by α=0.1875 (3/16)
-- LM head logits divided by α to restore original scale
-- Zero runtime overhead - transformation applied at conversion time
-- 100% token match with BF16 reference
 ### Quantization Results

 ---
 # ANEMLL
+## Apple Neural Engine Optimized
+This model is converted from Google's Gemma3 4B QAT (Quantization-Aware Training) for native execution on Apple Neural Engine (ANE).
+### FP16 Scaling
+ANE requires FP16 precision, but Gemma3's BF16-trained weights produce intermediate activations that overflow FP16's ±65,504 range—causing NaN/inf failures. We solve this with **weight scaling (α=0.1875)**:
+- Embeddings pre-scaled by 0.1875 at conversion time
+- LM head compensates with inverse scaling
+- Zero runtime overhead
+- Preserves 100% token match with original BF16 model
+### Quantization
+Additional LUT (Lookup Table) quantization applied for reduced model size:
+- **FFN layers**: 4-bit LUT with per-channel group size 4
+- **LM head**: 6-bit LUT with per-channel group size 4
 ### Quantization Results