Update README.md
Browse files
README.md
CHANGED
|
@@ -11,6 +11,43 @@ tags:
|
|
| 11 |
- Apple Neural Engine
|
| 12 |
- DeepHermes
|
| 13 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
# ANEMLL
|
| 15 |
|
| 16 |
**ANEMLL** (pronounced like "animal") is an open-source project focused on accelerating the porting of Large Language Models (LLMs) to tensor processors, starting with the Apple Neural Engine (ANE).
|
|
|
|
| 11 |
- Apple Neural Engine
|
| 12 |
- DeepHermes
|
| 13 |
---
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
## Model Quality Benchmarks
|
| 17 |
+
|
| 18 |
+
### FP16 Scaling for ANE Compatibility
|
| 19 |
+
|
| 20 |
+
Gemma3 4B QAT models produce activations that exceed FP16 range (±65,504) during inference. We apply **weight scaling (α=0.1875)** to prevent overflow:
|
| 21 |
+
|
| 22 |
+
- Embedding weights scaled by α=0.1875 (3/16)
|
| 23 |
+
- LM head logits divided by α to restore original scale
|
| 24 |
+
- Zero runtime overhead - transformation applied at conversion time
|
| 25 |
+
- 100% token match with BF16 reference
|
| 26 |
+
|
| 27 |
+
### Quantization Results
|
| 28 |
+
|
| 29 |
+
| Configuration | KL Divergence | Correlation | Match Rate | Notes |
|
| 30 |
+
|--------------|---------------|-------------|------------|-------|
|
| 31 |
+
| FP16 baseline (no LUT) | 0.0006 | 0.995 | 99.86% | Best quality |
|
| 32 |
+
| **FFN LUT4,4 + LM LUT6,4** | **0.196** | **0.959** | **90%** | ***This model*** |
|
| 33 |
+
| FFN LUT4,8 only | 0.284 | 0.971 | 87% | Larger size |
|
| 34 |
+
| FFN LUT4,8 + LM LUT6,4 | 0.279 | 0.970 | 86% | - |
|
| 35 |
+
|
| 36 |
+
### Metric Guidelines
|
| 37 |
+
|
| 38 |
+
| Metric | Healthy | Concerning |
|
| 39 |
+
|--------|---------|------------|
|
| 40 |
+
| KL Divergence | < 0.3 | > 0.5 |
|
| 41 |
+
| Correlation | > 0.95 | < 0.90 |
|
| 42 |
+
| Match Rate | > 85% | < 75% |
|
| 43 |
+
|
| 44 |
+
### Reference
|
| 45 |
+
|
| 46 |
+
- HF Model: `google/gemma-3-4b-it-qat-int4-unquantized`
|
| 47 |
+
- Scaling: α=0.1875 (FP16 overflow prevention)
|
| 48 |
+
- Context: 4096 tokens
|
| 49 |
+
- Sliding Window: 1024
|
| 50 |
+
-
|
| 51 |
# ANEMLL
|
| 52 |
|
| 53 |
**ANEMLL** (pronounced like "animal") is an open-source project focused on accelerating the porting of Large Language Models (LLMs) to tensor processors, starting with the Apple Neural Engine (ANE).
|