anemll commited on
Commit
0449180
·
verified ·
1 Parent(s): f305d12

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -12
README.md CHANGED
@@ -13,23 +13,24 @@ tags:
13
  ---
14
  # ANEMLL
15
 
16
- QAT version quantized for Apple Neural Engine.
17
 
18
- Gemma3 models trained with BF16 can produce residual stream activations that exceed FP16's representable range (±65,504). FP16 inference is a hard requirement for ANE. This causes:
19
- - Overflow to `inf` in FP16 computation
20
- - NaN propagation through subsequent layers
21
- - Complete model failure on ANE (which uses FP16)
22
 
23
- ## Model Quality Benchmarks
24
 
25
- ### FP16 Scaling for ANE Compatibility
26
 
27
- Gemma3 4B QAT models produce activations that exceed FP16 range (±65,504) during inference. We apply **weight scaling (α=0.1875)** to prevent overflow:
 
 
 
28
 
29
- - Embedding weights scaled by α=0.1875 (3/16)
30
- - LM head logits divided by α to restore original scale
31
- - Zero runtime overhead - transformation applied at conversion time
32
- - 100% token match with BF16 reference
 
33
 
34
  ### Quantization Results
35
 
 
13
  ---
14
  # ANEMLL
15
 
16
+ ## Apple Neural Engine Optimized
17
 
18
+ This model is converted from Google's Gemma3 4B QAT (Quantization-Aware Training) for native execution on Apple Neural Engine (ANE).
 
 
 
19
 
20
+ ### FP16 Scaling
21
 
22
+ ANE requires FP16 precision, but Gemma3's BF16-trained weights produce intermediate activations that overflow FP16's ±65,504 range—causing NaN/inf failures. We solve this with **weight scaling (α=0.1875)**:
23
 
24
+ - Embeddings pre-scaled by 0.1875 at conversion time
25
+ - LM head compensates with inverse scaling
26
+ - Zero runtime overhead
27
+ - Preserves 100% token match with original BF16 model
28
 
29
+ ### Quantization
30
+
31
+ Additional LUT (Lookup Table) quantization applied for reduced model size:
32
+ - **FFN layers**: 4-bit LUT with per-channel group size 4
33
+ - **LM head**: 6-bit LUT with per-channel group size 4
34
 
35
  ### Quantization Results
36