Update README.md
Browse filesremoved 4B version notes
README.md
CHANGED
|
@@ -15,49 +15,6 @@ tags:
|
|
| 15 |
|
| 16 |
## Apple Neural Engine Optimized
|
| 17 |
|
| 18 |
-
This model is converted from Google's Gemma3 4B QAT (Quantization-Aware Training) for native execution on Apple Neural Engine (ANE).
|
| 19 |
-
|
| 20 |
-
### FP16 Scaling
|
| 21 |
-
|
| 22 |
-
ANE requires FP16 precision, but Gemma3's BF16-trained weights produce intermediate activations that overflow FP16's ±65,504 range—causing NaN/inf failures. We solve this with **weight scaling (α=0.1875)**:
|
| 23 |
-
|
| 24 |
-
- Embeddings pre-scaled by 0.1875 at conversion time
|
| 25 |
-
- LM head compensates with inverse scaling
|
| 26 |
-
- Zero runtime overhead
|
| 27 |
-
- Preserves 100% token match with original BF16 model
|
| 28 |
-
|
| 29 |
-
### Quantization
|
| 30 |
-
|
| 31 |
-
Additional LUT (Lookup Table) quantization applied for reduced model size:
|
| 32 |
-
- **FFN layers**: 4-bit LUT with per-channel group size 4
|
| 33 |
-
- **LM head**: 6-bit LUT with per-channel group size 4
|
| 34 |
-
|
| 35 |
-
### Quantization Results
|
| 36 |
-
|
| 37 |
-
| Configuration | KL Divergence | Correlation | Match Rate | Notes |
|
| 38 |
-
|--------------|---------------|-------------|------------|-------|
|
| 39 |
-
| FP16 baseline (no LUT) | 0.0006 | 0.995 | 99.86% | Best quality |
|
| 40 |
-
| **FFN LUT4,4 + LM LUT6,4** | **0.196** | **0.959** | **90%** | ***This model*** |
|
| 41 |
-
| FFN LUT4,8 only | 0.284 | 0.971 | 87% | Larger size |
|
| 42 |
-
| FFN LUT4,8 + LM LUT6,4 | 0.279 | 0.970 | 86% | - |
|
| 43 |
-
|
| 44 |
-
### Metric Guidelines
|
| 45 |
-
|
| 46 |
-
| Metric | Healthy | Concerning |
|
| 47 |
-
|--------|---------|------------|
|
| 48 |
-
| KL Divergence | < 0.3 | > 0.5 |
|
| 49 |
-
| Correlation | > 0.95 | < 0.90 |
|
| 50 |
-
| Match Rate | > 85% | < 75% |
|
| 51 |
-
|
| 52 |
-
### Reference
|
| 53 |
-
|
| 54 |
-
- HF Model: `google/gemma-3-4b-it-qat-int4-unquantized`
|
| 55 |
-
- Scaling: α=0.1875 (FP16 overflow prevention)
|
| 56 |
-
- Context: 4096 tokens
|
| 57 |
-
- Sliding Window: 1024
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
**ANEMLL** (pronounced like "animal") is an open-source project focused on accelerating the porting of Large Language Models (LLMs) to tensor processors, starting with the Apple Neural Engine (ANE).
|
| 62 |
|
| 63 |
The goal is to provide a fully open-source pipeline from model conversion to inference for common LLM architectures running on ANE.
|
|
|
|
| 15 |
|
| 16 |
## Apple Neural Engine Optimized
|
| 17 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
**ANEMLL** (pronounced like "animal") is an open-source project focused on accelerating the porting of Large Language Models (LLMs) to tensor processors, starting with the Apple Neural Engine (ANE).
|
| 19 |
|
| 20 |
The goal is to provide a fully open-source pipeline from model conversion to inference for common LLM architectures running on ANE.
|