RelaxingSnorlax commited on
Commit
9cd3b28
·
verified ·
1 Parent(s): d9ab8e2

Upload folder using huggingface_hub

Browse files
Files changed (4) hide show
  1. README.md +140 -0
  2. config.json +84 -0
  3. metadata.json +15 -0
  4. model.safetensors +3 -0
README.md ADDED
@@ -0,0 +1,140 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: nvidia-open-model-license
4
+ license_link: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license
5
+ base_model: nvidia/Llama-4-Maverick-17B-128E-Eagle3
6
+ tags:
7
+ - speculative-decoding
8
+ - eagle3
9
+ - llama3
10
+ - llama4
11
+ - vllm
12
+ - speculators
13
+ ---
14
+
15
+ # Llama4-Maverick-Eagle3-Speculators
16
+
17
+ ## Model Description
18
+
19
+ **⚠️ Development Reference Model**: This model has been converted as a reference for development on vLLM. Once development is complete, it can be served using:
20
+ ```bash
21
+ vllm serve nm-testing/Llama4-Maverick-Eagle3-Speculators
22
+ ```
23
+
24
+ This is a manually converted Eagle3 speculator model based on NVIDIA's Llama-4-Maverick-17B-128E-Eagle3, reformatted for compatibility with the [Speculators](https://github.com/neuralmagic/speculators) library and vLLM speculative decoding.
25
+
26
+ ### Development Status
27
+ 🚧 **Reference Implementation for vLLM Development**
28
+ - This model serves as a reference implementation for vLLM Eagle3 support
29
+ - Contains non-standard features (auxiliary hidden states) that require vLLM extensions
30
+ - Once vLLM development is complete, will support direct serving
31
+
32
+ ### Key Features
33
+ - **Architecture**: Eagle3 speculator with Llama3-based draft head
34
+ - **Target Verifier**: Llama4 Maverick 17B (quantized w4a16)
35
+ - **Vocabulary Size**: 202,048 tokens (unusually large for a draft model)
36
+ - **Special Feature**: Uses auxiliary hidden states from verifier layers [1, 23, 44]
37
+
38
+ ## Configuration Details
39
+
40
+ This model represents a unique hybrid configuration:
41
+ - **Draft Model**: Llama3-based Eagle3 head (single transformer layer)
42
+ - **Verifier Model**: `RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16`
43
+ - **Architecture Class**: `Llama4ForConditionalGeneration` for verifier
44
+
45
+ ### Non-Standard Features
46
+
47
+ This model includes several non-standard Eagle3 features preserved from the NVIDIA checkpoint:
48
+ - Auxiliary hidden state layers from positions [1, 23, 44]
49
+ - Custom layer normalization configurations
50
+ - Large vocabulary matching the target model
51
+
52
+ ## Usage
53
+
54
+ ### With vLLM (After Development Complete)
55
+
56
+ ```bash
57
+ # Once vLLM development is complete, serve directly:
58
+ vllm serve nm-testing/Llama4-Maverick-Eagle3-Speculators
59
+ ```
60
+
61
+ ### With Speculators Library
62
+
63
+ ```python
64
+ from speculators import SpeculatorModel
65
+ from transformers import AutoModelForCausalLM
66
+
67
+ # Load the speculator
68
+ speculator = SpeculatorModel.from_pretrained("nm-testing/Llama4-Maverick-Eagle3-Speculators")
69
+
70
+ # Load and attach the verifier
71
+ verifier = AutoModelForCausalLM.from_pretrained(
72
+ "RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16",
73
+ trust_remote_code=True
74
+ )
75
+ speculator.attach_verifier(verifier)
76
+
77
+ # Use for generation
78
+ outputs = speculator.generate(input_ids, max_length=100)
79
+ ```
80
+
81
+ ## Configuration Structure
82
+
83
+ The model uses the Speculators Eagle3 format with additional fields for NVIDIA-specific features:
84
+
85
+ ```json
86
+ {
87
+ "speculators_model_type": "eagle3",
88
+ "architectures": ["Eagle3Speculator"],
89
+ "draft_vocab_size": 202048,
90
+ "transformer_layer_config": {
91
+ "rope_scaling": {
92
+ "rope_type": "llama3" // Confirms Llama3 architecture
93
+ }
94
+ },
95
+ "eagle_aux_hidden_state_layer_ids": [1, 23, 44],
96
+ "use_aux_hidden_state": true
97
+ }
98
+ ```
99
+
100
+ ## Performance Notes
101
+
102
+ - **Vocabulary Size**: The 202K vocabulary is unusually large and may impact memory usage
103
+ - **Auxiliary Hidden States**: May require custom Eagle3Speculator extensions for full functionality
104
+ - **Acceptance Rate**: Expected ~2-3 tokens per forward pass based on NVIDIA benchmarks
105
+
106
+ ## Model Weights
107
+
108
+ - **Format**: SafeTensors
109
+ - **Precision**: bfloat16
110
+ - **Size**: ~3.2GB
111
+
112
+ ## Citation
113
+
114
+ If you use this model, please cite both the original NVIDIA model and the Speculators library:
115
+
116
+ ```bibtex
117
+ @misc{nvidia2025llama4maverick,
118
+ title={Llama 4 Maverick 17B Eagle3},
119
+ author={NVIDIA Corporation},
120
+ year={2025},
121
+ publisher={Hugging Face}
122
+ }
123
+
124
+ @misc{speculators2024,
125
+ title={Speculators: A Unified Library for Speculative Decoding},
126
+ author={Neural Magic},
127
+ year={2024},
128
+ url={https://github.com/neuralmagic/speculators}
129
+ }
130
+ ```
131
+
132
+ ## License
133
+
134
+ This model is subject to the NVIDIA Open Model License. Please review the license terms before use.
135
+
136
+ ## Acknowledgments
137
+
138
+ - Original model by NVIDIA Corporation
139
+ - Conversion and formatting for Speculators/vLLM compatibility
140
+ - Based on Eagle3 architecture with Llama3 draft head targeting Llama4 verifier
config.json ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Eagle3Speculator"
4
+ ],
5
+ "speculators_model_type": "eagle3",
6
+ "speculators_version": "0.1.0.dev42",
7
+ "draft_vocab_size": 202048,
8
+ "norm_before_residual": true,
9
+ "target_hidden_size": null,
10
+ "transformer_layer_config": {
11
+ "model_type": "llama",
12
+ "vocab_size": 202048,
13
+ "hidden_size": 5120,
14
+ "intermediate_size": 32768,
15
+ "num_hidden_layers": 1,
16
+ "num_attention_heads": 40,
17
+ "num_key_value_heads": 8,
18
+ "head_dim": 128,
19
+ "hidden_act": "silu",
20
+ "max_position_embeddings": 1048576,
21
+ "initializer_range": 0.02,
22
+ "rms_norm_eps": 1e-05,
23
+ "pretraining_tp": 1,
24
+ "use_cache": true,
25
+ "rope_theta": 500000.0,
26
+ "rope_scaling": {
27
+ "factor": 8.0,
28
+ "high_freq_factor": 4.0,
29
+ "low_freq_factor": 1.0,
30
+ "original_max_position_embeddings": 8192,
31
+ "rope_type": "llama3"
32
+ },
33
+ "attention_bias": false,
34
+ "attention_dropout": 0.0,
35
+ "mlp_bias": false,
36
+ "tie_word_embeddings": false
37
+ },
38
+ "speculators_config": {
39
+ "algorithm": "eagle3",
40
+ "default_proposal_method": "greedy",
41
+ "proposal_methods": [
42
+ {
43
+ "proposal_type": "greedy",
44
+ "speculative_tokens": 3,
45
+ "verifier_accept_k": 1,
46
+ "accept_tolerance": 0.0
47
+ }
48
+ ],
49
+ "verifier": {
50
+ "name_or_path": "RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16",
51
+ "architectures": [
52
+ "Llama4ForConditionalGeneration"
53
+ ]
54
+ }
55
+ },
56
+ "torch_dtype": "bfloat16",
57
+ "eagle_aux_hidden_state_layer_ids": [
58
+ 1,
59
+ 23,
60
+ 44
61
+ ],
62
+ "use_aux_hidden_state": true,
63
+ "use_input_layernorm_in_first_layer": true,
64
+ "use_mtp_layernorm": false,
65
+ "eagle_config": {
66
+ "eagle_aux_hidden_state_layer_ids": [
67
+ 1,
68
+ 23,
69
+ 44
70
+ ],
71
+ "use_aux_hidden_state": true,
72
+ "use_input_layernorm_in_first_layer": true,
73
+ "use_last_layernorm": true,
74
+ "use_mtp_layernorm": false
75
+ },
76
+ "_comment": "Eagle3 head based on Llama3 architecture targeting Llama4 Maverick verifier",
77
+ "_conversion_notes": {
78
+ "source": "nvidia/Llama-4-Maverick-17B-128E-Eagle3",
79
+ "architecture_notes": "Eagle3 head uses Llama3 rope_type, targets Llama4 verifier",
80
+ "vocabulary_notes": "Large 202K vocabulary, same for draft and target",
81
+ "auxiliary_layers": "Uses hidden states from verifier layers 1, 23, 44",
82
+ "implementation_note": "May require Eagle3Speculator extensions for aux hidden states"
83
+ }
84
+ }
metadata.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "conversion_tool": "create_final_eagle3_config.py",
3
+ "source_checkpoint": "nvidia/Llama-4-Maverick-17B-128E-Eagle3",
4
+ "format": "speculators-eagle3",
5
+ "architecture": "Llama3-based Eagle3 head",
6
+ "verifier": "Llama4 Maverick",
7
+ "notes": [
8
+ "Eagle3 head based on Llama3 architecture (rope_type: llama3)",
9
+ "Targets Llama4 Maverick verifier (Llama4ForConditionalGeneration)",
10
+ "Large vocabulary of 202,048 tokens",
11
+ "Uses auxiliary hidden states from layers 1, 23, 44",
12
+ "NVIDIA-specific fields preserved as extra configuration",
13
+ "May require Eagle3Speculator implementation extensions"
14
+ ]
15
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b09a7b8316252795759a0ef96e0420eedba7752323a1c02eb259c57536a20c90
3
+ size 3432162640