Update README.md
Browse files
README.md
CHANGED
|
@@ -13,23 +13,24 @@ tags:
|
|
| 13 |
---
|
| 14 |
# ANEMLL
|
| 15 |
|
| 16 |
-
|
| 17 |
|
| 18 |
-
|
| 19 |
-
- Overflow to `inf` in FP16 computation
|
| 20 |
-
- NaN propagation through subsequent layers
|
| 21 |
-
- Complete model failure on ANE (which uses FP16)
|
| 22 |
|
| 23 |
-
##
|
| 24 |
|
| 25 |
-
|
| 26 |
|
| 27 |
-
|
|
|
|
|
|
|
|
|
|
| 28 |
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
-
|
|
|
|
| 33 |
|
| 34 |
### Quantization Results
|
| 35 |
|
|
|
|
| 13 |
---
|
| 14 |
# ANEMLL
|
| 15 |
|
| 16 |
+
## Apple Neural Engine Optimized
|
| 17 |
|
| 18 |
+
This model is converted from Google's Gemma3 4B QAT (Quantization-Aware Training) for native execution on Apple Neural Engine (ANE).
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
+
### FP16 Scaling
|
| 21 |
|
| 22 |
+
ANE requires FP16 precision, but Gemma3's BF16-trained weights produce intermediate activations that overflow FP16's ±65,504 range—causing NaN/inf failures. We solve this with **weight scaling (α=0.1875)**:
|
| 23 |
|
| 24 |
+
- Embeddings pre-scaled by 0.1875 at conversion time
|
| 25 |
+
- LM head compensates with inverse scaling
|
| 26 |
+
- Zero runtime overhead
|
| 27 |
+
- Preserves 100% token match with original BF16 model
|
| 28 |
|
| 29 |
+
### Quantization
|
| 30 |
+
|
| 31 |
+
Additional LUT (Lookup Table) quantization applied for reduced model size:
|
| 32 |
+
- **FFN layers**: 4-bit LUT with per-channel group size 4
|
| 33 |
+
- **LM head**: 6-bit LUT with per-channel group size 4
|
| 34 |
|
| 35 |
### Quantization Results
|
| 36 |
|