arnomatic commited on
Commit
314afc2
·
verified ·
1 Parent(s): 206a410

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +334 -269
README.md CHANGED
@@ -1,270 +1,335 @@
1
- ---
2
- base_model: ByteDance-Seed/Seed-OSS-36B-Base
3
- language:
4
- - en
5
- library_name: transformers
6
- license: apache-2.0
7
- pipeline_tag: text-generation
8
- tags:
9
- - Bytedance Seed
10
- - instruct
11
- - finetune
12
- - reasoning
13
- - hybrid-mode
14
- - chatml
15
- - function calling
16
- - tool use
17
- - json mode
18
- - structured outputs
19
- - atropos
20
- - dataforge
21
- - long context
22
- - roleplaying
23
- - chat
24
- - heretic
25
- - uncensored
26
- - decensored
27
- - abliterated
28
- widget:
29
- - example_title: Hermes 4
30
- messages:
31
- - role: system
32
- content: You are Hermes 4, a capable, neutrally-aligned assistant. Prefer concise,
33
- correct answers.
34
- - role: user
35
- content: Explain the difference between BFS and DFS to a new CS student.
36
- model-index:
37
- - name: Hermes-4.3-ByteDance-Seed-36B
38
- results: []
39
- ---
40
- # This is a decensored version of [NousResearch/Hermes-4.3-36B](https://huggingface.co/NousResearch/Hermes-4.3-36B), made using [Heretic](https://github.com/p-e-w/heretic) v1.1.0
41
-
42
- ## Abliteration parameters
43
-
44
- | Parameter | Value |
45
- | :-------- | :---: |
46
- | **direction_index** | 49.93 |
47
- | **attn.o_proj.max_weight** | 1.26 |
48
- | **attn.o_proj.max_weight_position** | 53.05 |
49
- | **attn.o_proj.min_weight** | 1.01 |
50
- | **attn.o_proj.min_weight_distance** | 26.38 |
51
- | **mlp.down_proj.max_weight** | 1.45 |
52
- | **mlp.down_proj.max_weight_position** | 38.47 |
53
- | **mlp.down_proj.min_weight** | 0.44 |
54
- | **mlp.down_proj.min_weight_distance** | 10.52 |
55
-
56
- ## Performance
57
-
58
- | Metric | This model | Original model ([NousResearch/Hermes-4.3-36B](https://huggingface.co/NousResearch/Hermes-4.3-36B)) |
59
- | :----- | :--------: | :---------------------------: |
60
- | **KL divergence** | 0.0157 | 0 *(by definition)* |
61
- | **Refusals** | 6/100 | 81/100 |
62
-
63
- -----
64
-
65
-
66
- # Hermes 4.3 - Seed 36B
67
-
68
- ![4.3 small](https://cdn-uploads.huggingface.co/production/uploads/6317aade83d8d2fd903192d9/RoJnKcyjYfd5zj3PP3eRq.png)
69
-
70
- ## Model Description
71
-
72
- Hermes 4.3 36B is a frontier, hybrid-mode **reasoning** model based on ByteDance Seed 36B base, made by Nous Research that is aligned to **you**.
73
-
74
- This is our first Hermes model trained in a decentralized manner over the internet using [Psyche](https://psyche.network), read the blog post: https://nousresearch.com/introducing-hermes-4-3/
75
-
76
- Read the Hermes 4 technical report here: <a href="https://arxiv.org/abs/2508.18255">Hermes 4 Technical Report</a>
77
-
78
- Chat with Hermes in Nous Chat: https://chat.nousresearch.com
79
-
80
- Training highlights include a newly synthesized post-training corpus emphasizing verified reasoning traces, massive improvements in math, code, STEM, logic, creativity, and format-faithful outputs, while preserving general assistant quality and broadly neutral alignment.
81
-
82
- ## What’s new vs Hermes 3
83
-
84
- - **Post-training corpus**: Massively increased dataset size from 1M samples and 1.2B tokens to **~5M samples / ~60B tokens** blended across reasoning and non-reasoning data.
85
- - **Hybrid reasoning mode** with explicit `<think>…</think>` segments when the model decides to deliberate, and options to make your responses faster when you want.
86
- - **Reasoning** that is top quality, expressive, improves math, code, STEM, logic, and even creative writing and subjective responses.
87
- - **Schema adherence & structured outputs**: trained to produce valid JSON for given schemas and to repair malformed objects.
88
- - **Much easier to steer and align**: extreme improvements on steerability, especially on reduced refusal rates.
89
-
90
- ## Our Mission: Frontier Capabilities Aligned to You
91
-
92
- In pursuit of the mission of producing models that are open, steerable and capable of producing the full range of human expression, while being able to be aligned to your values, we created a new benchmark, RefusalBench, that tests the models willingness to be helpful in a variety of scenarios commonly disallowed by closed and open models.
93
-
94
- **Hermes 4.3 36B is now SOTA across non-abliterated models on the RefusalBench Leaderboard, surpassing our previous best of 59.5% on Hermes 4 70B**
95
- ### % of Questions Answered – RefusalBench
96
- *(Average of 5 trials)*
97
- | Model | % of Questions Answered |
98
- | :--- | :--- |
99
- | **Hermes 4.3 36B Non-Reasoning** | 74.60% |
100
- | **Hermes 4.3 36B Reasoning** | 72.29% |
101
- | Hermes 4 70B Reasoning | 59.50% |
102
- | Hermes 4 405B Reasoning | 57.10% |
103
- | grok4 | 51.30% |
104
- | Hermes 4 70B | 49.07% |
105
- | Hermes 4 405B | 43.20% |
106
- | Qwen2.5 7B | 36.10% |
107
- | Qwen3 235B Reasoning | 34.30% |
108
- | DeepSeek V3 | 28.10% |
109
- | Gemini 2.5 Pro | 24.23% |
110
- | Llama 405B | 21.70% |
111
- | Gemini 2.5 Flash | 19.13% |
112
- | GPT4o | 17.67% |
113
- | Sonnet 4 | 17.00% |
114
- | GPT4-mini | 16.76% |
115
- | R1 | 16.70% |
116
- | cogito-v2-405B Reasoning| 15.40% |
117
- | Opus 4.1 | 15.38% |
118
- | Qwen3 235B | 15.30% |
119
- | cogito-v2-405B | 14.94% |
120
- | cogito-v2-405B | 12.10% |
121
- | GPT 5 | 11.34% |
122
- | gpt-oss 120B | 5.60% |
123
- | gpt-oss 20B | 4.79% |
124
-
125
- Hermes 4 achieves SOTA on RefusalBench across all popular closed and open models in being helpful and conforming to your values, without censorship.
126
-
127
- ## Benchmarks (Hermes 4.3 36B)
128
-
129
- | | Hermes 4.3 36B Psyche | Hermes 4.3 36B Centralized | Hermes 4 70B Centralized |
130
- | :--- | :--- | :--- | :--- |
131
- | AIME 24 | 71.9 | 70.6 | 73.5 |
132
- | AIME 25 | 69.3 | 66.8 | 67.4 |
133
- | BBH | 86.4 | 84.7 | 87.8 |
134
- | DROP | 83.5 | 81.6 | 85.0 |
135
- | GPQA Diamond | 65.5 | 64.8 | 66.1 |
136
- | IFEval | 77.9 | 73.9 | 78.7 |
137
- | MATH-500 | 93.8 | 92.3 | 95.5 |
138
- | MMLU | 87.7 | 86.5 | 88.4 |
139
- | MMLU-Pro | 80.7 | 79.7 | 80.7 |
140
- | MuSR | 69.7 | 64.7 | 70.4 |
141
- | OBQA | 96.6 | 91.8 | 94.8 |
142
- | SimpleQA | 6.0 | 5.6 | 17.9 |
143
-
144
- ## Prompt Format
145
-
146
- Hermes 4 uses Llama-3-Chat format with role headers and special tags.
147
-
148
- **Basic chat:**
149
- ```
150
- <|start_header_id|>system<|end_header_id|>
151
-
152
- You are Hermes 4. Be concise and helpful.<|eot_id|>
153
- <|start_header_id|>user<|end_header_id|>
154
-
155
- Explain the photoelectric effect simply.<|eot_id|>
156
- <|start_header_id|>assistant<|end_header_id|>
157
- ```
158
-
159
- ### Reasoning mode
160
-
161
- Reasoning mode can be activated with the chat template via the flag `thinking=True` or by using the following system prompt:
162
-
163
- ```
164
- You are a deep thinking AI, you may use extremely long chains of thought to deeply consider the problem and deliberate with yourself via systematic reasoning processes to help come to a correct solution prior to answering. You should enclose your thoughts and internal monologue inside <think> </think> tags, and then provide your solution or response to the problem.
165
- ```
166
-
167
- Note that you can add any additional system instructions before or after this system message, and it will adjust the models policies, style, and effort of thinking, as well as its post-thinking style, format, identity, and more. You may also interleave the tool definition system message with the reasoning one.
168
-
169
- When the model chooses to deliberate, it emits:
170
-
171
- ```
172
- <|start_header_id|>assistant<|end_header_id|>
173
- <think>
174
- …model’s internal reasoning may appear here…
175
- </think>
176
- Final response starts here…<|eot_id|>
177
- ```
178
-
179
- Additionally, we provide a flag to keep the content inbetween the `<think> ... </think>` that you can play with by setting `keep_cots=True`
180
-
181
-
182
- ## Function Calling & Tool Use
183
-
184
- Hermes 4 supports function/tool calls *within* a single assistant turn, produced after it's reasoning:
185
-
186
- **System message (example):**
187
-
188
- ```
189
- <|start_header_id|>system<|end_header_id|>
190
- You are a function-calling AI. Tools are provided inside <tools>…</tools>.
191
- When appropriate, call a tool by emitting a <tool_call>{...}</tool_call> object.
192
- After a tool responds (as <tool_response>), continue reasoning inside <think> and produce the final answer.
193
- <tools>
194
- {"type":"function","function":{"name":"get_weather","description":"Get weather by city","parameters":{"type":"object","properties":{"city":{"type":"string"}},"required":["city"]}}}
195
- </tools><|eot_id|>
196
- ```
197
-
198
- Note that you may also simply place tool definitions into the "tools:" field of your messages, and the chat template will parse and create the system prompt for you. This also works with reasoning mode for improved accuracy of tool use.
199
-
200
- The model will then generate tool calls within `<tool_call> {tool_call} </tool_call>` tags, for easy parsing. The tool_call tags are also added tokens, so it makes it easy to parse while streaming! There are also automatic tool parsers built-in to VLLM and SGLang for Hermes, just set the tool parser in VLLM to `hermes` and in SGLang to `qwen25`.
201
-
202
- ## Inference Notes
203
-
204
- - **Sampling defaults that work well:** `temperature=0.6, top_p=0.95, top_k=20`.
205
- - **Template:** Use the Llama chat format for Hermes 4.3 36B, 70B, and 405B as shown above, or set `add_generation_prompt=True` when using `tokenizer.apply_chat_template(...)`.
206
-
207
- ### Transformers example
208
-
209
- ```python
210
- from transformers import AutoTokenizer, AutoModelForCausalLM
211
- import torch
212
-
213
- model_id = "NousResearch/Hermes-4.3-36B"
214
-
215
- tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
216
- model = AutoModelForCausalLM.from_pretrained(
217
- model_id,
218
- torch_dtype=torch.float16,
219
- device_map="auto"
220
- )
221
-
222
- messages = [
223
- {"role":"system","content":"You are Hermes 4. Be concise."},
224
- {"role":"user","content":"Summarize CRISPR in 3 sentences."}
225
- ]
226
-
227
- inputs = tokenizer.apply_chat_template(
228
- messages, add_generation_prompt=True, return_tensors="pt"
229
- ).to(model.device)
230
-
231
- outputs = model.generate(
232
- **inputs, max_new_tokens=400, temperature=0.6, top_p=0.95, top_k=20, do_sample=True
233
- )
234
- print(tokenizer.decode(outputs[0], skip_special_tokens=True))
235
- ```
236
-
237
- For production serving on multi-GPU nodes, consider tensor parallel inference engines (e.g., SGLang/vLLM backends) with prefix caching.
238
-
239
- ## Inference Providers:
240
-
241
- ### Nous Portal:
242
-
243
- <a href="https://portal.nousresearch.com"><img width=256 alt="chutes logo" src="https://cdn-uploads.huggingface.co/production/uploads/6317aade83d8d2fd903192d9/6YytY7N0mjCnBQvWo3qtv.png"></a>
244
-
245
- ### Chutes:
246
-
247
- <a href="https://chutes.ai/app"><img width=256 alt="chutes logo" src="https://cdn-uploads.huggingface.co/production/uploads/6317aade83d8d2fd903192d9/l14AWPv6cSvaprpwK_IWY.png"></a>
248
-
249
- # Quantized / Smaller Variants
250
-
251
- Hermes 4 is available as BF16 original weights as well as BF16 as well as GGUF variants.
252
-
253
- GGUF Verions (4, 5, 6, and 8bit): https://huggingface.co/NousResearch/Hermes-4.3-36B-GGUF
254
-
255
- See the Hermes 4 collection to explore them all:
256
- https://huggingface.co/collections/NousResearch/hermes-4-collection-68a731bfd452e20816725728
257
-
258
- # How to cite
259
-
260
- ```bibtex
261
- @misc{teknium2025hermes4technicalreport,
262
- title={Hermes 4 Technical Report},
263
- author={Ryan Teknium and Roger Jin and Jai Suphavadeeprasit and Dakota Mahan and Jeffrey Quesnelle and Joe Li and Chen Guang and Shannon Sands and Karan Malhotra},
264
- year={2025},
265
- eprint={2508.18255},
266
- archivePrefix={arXiv},
267
- primaryClass={cs.AI},
268
- url={https://arxiv.org/abs/2508.18255},
269
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
270
  ```
 
1
+ ---
2
+ base_model: ByteDance-Seed/Seed-OSS-36B-Base
3
+ language:
4
+ - en
5
+ library_name: transformers
6
+ license: apache-2.0
7
+ pipeline_tag: text-generation
8
+ tags:
9
+ - Bytedance Seed
10
+ - instruct
11
+ - finetune
12
+ - reasoning
13
+ - hybrid-mode
14
+ - chatml
15
+ - function calling
16
+ - tool use
17
+ - json mode
18
+ - structured outputs
19
+ - atropos
20
+ - dataforge
21
+ - long context
22
+ - roleplaying
23
+ - chat
24
+ - heretic
25
+ - uncensored
26
+ - decensored
27
+ - abliterated
28
+ widget:
29
+ - example_title: Hermes 4
30
+ messages:
31
+ - role: system
32
+ content: You are Hermes 4, a capable, neutrally-aligned assistant. Prefer concise,
33
+ correct answers.
34
+ - role: user
35
+ content: Explain the difference between BFS and DFS to a new CS student.
36
+ model-index:
37
+ - name: Hermes-4.3-ByteDance-Seed-36B
38
+ results: []
39
+ ---
40
+ # This is a decensored version of [NousResearch/Hermes-4.3-36B](https://huggingface.co/NousResearch/Hermes-4.3-36B), made using [Heretic](https://github.com/p-e-w/heretic) v1.1.0
41
+
42
+ ## Abliteration parameters
43
+
44
+ | Parameter | Value |
45
+ | :-------- | :---: |
46
+ | **direction_index** | 49.93 |
47
+ | **attn.o_proj.max_weight** | 1.26 |
48
+ | **attn.o_proj.max_weight_position** | 53.05 |
49
+ | **attn.o_proj.min_weight** | 1.01 |
50
+ | **attn.o_proj.min_weight_distance** | 26.38 |
51
+ | **mlp.down_proj.max_weight** | 1.45 |
52
+ | **mlp.down_proj.max_weight_position** | 38.47 |
53
+ | **mlp.down_proj.min_weight** | 0.44 |
54
+ | **mlp.down_proj.min_weight_distance** | 10.52 |
55
+
56
+ ## Performance
57
+
58
+ | Metric | This model | Original model ([NousResearch/Hermes-4.3-36B](https://huggingface.co/NousResearch/Hermes-4.3-36B)) |
59
+ | :----- | :--------: | :---------------------------: |
60
+ | **KL divergence** | 0.0157 | 0 *(by definition)* |
61
+ | **Refusals** | 6/100 | 81/100 |
62
+
63
+ USE:
64
+ ```python
65
+ from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
66
+ import torch
67
+
68
+ model_id = "arnomatic/Hermes-4.3-36B-heretic"
69
+
70
+ bnb_config = BitsAndBytesConfig(
71
+ load_in_4bit=True,
72
+ bnb_4bit_quant_type="nf4",
73
+ bnb_4bit_compute_dtype=torch.float16,
74
+ bnb_4bit_use_double_quant=True
75
+ )
76
+
77
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
78
+ model = AutoModelForCausalLM.from_pretrained(
79
+ model_id,
80
+ quantization_config=bnb_config,
81
+ device_map="auto"
82
+ )
83
+
84
+ messages = [
85
+ {"role":"system","content":"You are Hermes 4. Be concise."},
86
+ {"role":"user","content":"Summarize CRISPR in 3 sentences."}
87
+ ]
88
+
89
+ # FIX 1: return_dict=True sorgt dafür, dass wir ein Dictionary bekommen!
90
+ # Dann funktioniert auch **inputs
91
+ inputs = tokenizer.apply_chat_template(
92
+ messages,
93
+ add_generation_prompt=True,
94
+ return_tensors="pt",
95
+ return_dict=True
96
+ ).to(model.device)
97
+
98
+ # FIX 2: token_type_ids entfernen - Llama/Hermes nutzt diese nicht
99
+ if 'token_type_ids' in inputs:
100
+ del inputs['token_type_ids']
101
+
102
+ # FIX 3: Stop-Tokens für Llama-3/Hermes definieren
103
+ terminators = [
104
+ tokenizer.eos_token_id,
105
+ tokenizer.convert_tokens_to_ids("<|eot_id|>")
106
+ ]
107
+
108
+ # Jetzt klappt **inputs, weil inputs ein Dict ist (input_ids + attention_mask)
109
+ outputs = model.generate(
110
+ **inputs,
111
+ max_new_tokens=400,
112
+ temperature=0.6,
113
+ top_p=0.95,
114
+ top_k=20,
115
+ do_sample=True,
116
+ eos_token_id=terminators,
117
+ pad_token_id=tokenizer.eos_token_id
118
+ )
119
+
120
+ # Output dekodieren (nur den neuen Teil)
121
+ response = outputs[0][inputs['input_ids'].shape[-1]:]
122
+ print(tokenizer.decode(response, skip_special_tokens=True))
123
+ ```
124
+ output:
125
+
126
+ "CRISPR is a revolutionary gene-editing technology that allows scientists to make precise changes to DNA sequences. It uses a bacterial defense mechanism to target and cut specific genes, enabling researchers to add, remove, or alter genetic material with unprecedented accuracy. This powerful tool has the potential to revolutionize medicine, agriculture, and biotechnology by enabling the development of new treatments for genetic diseases, the creation of more resilient crops, and the engineering of organisms with desired traits"
127
+
128
+ -----
129
+
130
+
131
+ # Hermes 4.3 - Seed 36B
132
+
133
+ ![4.3 small](https://cdn-uploads.huggingface.co/production/uploads/6317aade83d8d2fd903192d9/RoJnKcyjYfd5zj3PP3eRq.png)
134
+
135
+ ## Model Description
136
+
137
+ Hermes 4.3 36B is a frontier, hybrid-mode **reasoning** model based on ByteDance Seed 36B base, made by Nous Research that is aligned to **you**.
138
+
139
+ This is our first Hermes model trained in a decentralized manner over the internet using [Psyche](https://psyche.network), read the blog post: https://nousresearch.com/introducing-hermes-4-3/
140
+
141
+ Read the Hermes 4 technical report here: <a href="https://arxiv.org/abs/2508.18255">Hermes 4 Technical Report</a>
142
+
143
+ Chat with Hermes in Nous Chat: https://chat.nousresearch.com
144
+
145
+ Training highlights include a newly synthesized post-training corpus emphasizing verified reasoning traces, massive improvements in math, code, STEM, logic, creativity, and format-faithful outputs, while preserving general assistant quality and broadly neutral alignment.
146
+
147
+ ## What’s new vs Hermes 3
148
+
149
+ - **Post-training corpus**: Massively increased dataset size from 1M samples and 1.2B tokens to **~5M samples / ~60B tokens** blended across reasoning and non-reasoning data.
150
+ - **Hybrid reasoning mode** with explicit `<think></think>` segments when the model decides to deliberate, and options to make your responses faster when you want.
151
+ - **Reasoning** that is top quality, expressive, improves math, code, STEM, logic, and even creative writing and subjective responses.
152
+ - **Schema adherence & structured outputs**: trained to produce valid JSON for given schemas and to repair malformed objects.
153
+ - **Much easier to steer and align**: extreme improvements on steerability, especially on reduced refusal rates.
154
+
155
+ ## Our Mission: Frontier Capabilities Aligned to You
156
+
157
+ In pursuit of the mission of producing models that are open, steerable and capable of producing the full range of human expression, while being able to be aligned to your values, we created a new benchmark, RefusalBench, that tests the models willingness to be helpful in a variety of scenarios commonly disallowed by closed and open models.
158
+
159
+ **Hermes 4.3 36B is now SOTA across non-abliterated models on the RefusalBench Leaderboard, surpassing our previous best of 59.5% on Hermes 4 70B**
160
+ ### % of Questions Answered – RefusalBench
161
+ *(Average of 5 trials)*
162
+ | Model | % of Questions Answered |
163
+ | :--- | :--- |
164
+ | **Hermes 4.3 36B Non-Reasoning** | 74.60% |
165
+ | **Hermes 4.3 36B Reasoning** | 72.29% |
166
+ | Hermes 4 70B Reasoning | 59.50% |
167
+ | Hermes 4 405B Reasoning | 57.10% |
168
+ | grok4 | 51.30% |
169
+ | Hermes 4 70B | 49.07% |
170
+ | Hermes 4 405B | 43.20% |
171
+ | Qwen2.5 7B | 36.10% |
172
+ | Qwen3 235B Reasoning | 34.30% |
173
+ | DeepSeek V3 | 28.10% |
174
+ | Gemini 2.5 Pro | 24.23% |
175
+ | Llama 405B | 21.70% |
176
+ | Gemini 2.5 Flash | 19.13% |
177
+ | GPT4o | 17.67% |
178
+ | Sonnet 4 | 17.00% |
179
+ | GPT4-mini | 16.76% |
180
+ | R1 | 16.70% |
181
+ | cogito-v2-405B Reasoning| 15.40% |
182
+ | Opus 4.1 | 15.38% |
183
+ | Qwen3 235B | 15.30% |
184
+ | cogito-v2-405B | 14.94% |
185
+ | cogito-v2-405B | 12.10% |
186
+ | GPT 5 | 11.34% |
187
+ | gpt-oss 120B | 5.60% |
188
+ | gpt-oss 20B | 4.79% |
189
+
190
+ Hermes 4 achieves SOTA on RefusalBench across all popular closed and open models in being helpful and conforming to your values, without censorship.
191
+
192
+ ## Benchmarks (Hermes 4.3 36B)
193
+
194
+ | | Hermes 4.3 36B Psyche | Hermes 4.3 36B Centralized | Hermes 4 70B Centralized |
195
+ | :--- | :--- | :--- | :--- |
196
+ | AIME 24 | 71.9 | 70.6 | 73.5 |
197
+ | AIME 25 | 69.3 | 66.8 | 67.4 |
198
+ | BBH | 86.4 | 84.7 | 87.8 |
199
+ | DROP | 83.5 | 81.6 | 85.0 |
200
+ | GPQA Diamond | 65.5 | 64.8 | 66.1 |
201
+ | IFEval | 77.9 | 73.9 | 78.7 |
202
+ | MATH-500 | 93.8 | 92.3 | 95.5 |
203
+ | MMLU | 87.7 | 86.5 | 88.4 |
204
+ | MMLU-Pro | 80.7 | 79.7 | 80.7 |
205
+ | MuSR | 69.7 | 64.7 | 70.4 |
206
+ | OBQA | 96.6 | 91.8 | 94.8 |
207
+ | SimpleQA | 6.0 | 5.6 | 17.9 |
208
+
209
+ ## Prompt Format
210
+
211
+ Hermes 4 uses Llama-3-Chat format with role headers and special tags.
212
+
213
+ **Basic chat:**
214
+ ```
215
+ <|start_header_id|>system<|end_header_id|>
216
+
217
+ You are Hermes 4. Be concise and helpful.<|eot_id|>
218
+ <|start_header_id|>user<|end_header_id|>
219
+
220
+ Explain the photoelectric effect simply.<|eot_id|>
221
+ <|start_header_id|>assistant<|end_header_id|>
222
+ ```
223
+
224
+ ### Reasoning mode
225
+
226
+ Reasoning mode can be activated with the chat template via the flag `thinking=True` or by using the following system prompt:
227
+
228
+ ```
229
+ You are a deep thinking AI, you may use extremely long chains of thought to deeply consider the problem and deliberate with yourself via systematic reasoning processes to help come to a correct solution prior to answering. You should enclose your thoughts and internal monologue inside <think> </think> tags, and then provide your solution or response to the problem.
230
+ ```
231
+
232
+ Note that you can add any additional system instructions before or after this system message, and it will adjust the models policies, style, and effort of thinking, as well as its post-thinking style, format, identity, and more. You may also interleave the tool definition system message with the reasoning one.
233
+
234
+ When the model chooses to deliberate, it emits:
235
+
236
+ ```
237
+ <|start_header_id|>assistant<|end_header_id|>
238
+ <think>
239
+ …model’s internal reasoning may appear here…
240
+ </think>
241
+ Final response starts here…<|eot_id|>
242
+ ```
243
+
244
+ Additionally, we provide a flag to keep the content inbetween the `<think> ... </think>` that you can play with by setting `keep_cots=True`
245
+
246
+
247
+ ## Function Calling & Tool Use
248
+
249
+ Hermes 4 supports function/tool calls *within* a single assistant turn, produced after it's reasoning:
250
+
251
+ **System message (example):**
252
+
253
+ ```
254
+ <|start_header_id|>system<|end_header_id|>
255
+ You are a function-calling AI. Tools are provided inside <tools>…</tools>.
256
+ When appropriate, call a tool by emitting a <tool_call>{...}</tool_call> object.
257
+ After a tool responds (as <tool_response>), continue reasoning inside <think> and produce the final answer.
258
+ <tools>
259
+ {"type":"function","function":{"name":"get_weather","description":"Get weather by city","parameters":{"type":"object","properties":{"city":{"type":"string"}},"required":["city"]}}}
260
+ </tools><|eot_id|>
261
+ ```
262
+
263
+ Note that you may also simply place tool definitions into the "tools:" field of your messages, and the chat template will parse and create the system prompt for you. This also works with reasoning mode for improved accuracy of tool use.
264
+
265
+ The model will then generate tool calls within `<tool_call> {tool_call} </tool_call>` tags, for easy parsing. The tool_call tags are also added tokens, so it makes it easy to parse while streaming! There are also automatic tool parsers built-in to VLLM and SGLang for Hermes, just set the tool parser in VLLM to `hermes` and in SGLang to `qwen25`.
266
+
267
+ ## Inference Notes
268
+
269
+ - **Sampling defaults that work well:** `temperature=0.6, top_p=0.95, top_k=20`.
270
+ - **Template:** Use the Llama chat format for Hermes 4.3 36B, 70B, and 405B as shown above, or set `add_generation_prompt=True` when using `tokenizer.apply_chat_template(...)`.
271
+
272
+ ### Transformers example
273
+
274
+ ```python
275
+ from transformers import AutoTokenizer, AutoModelForCausalLM
276
+ import torch
277
+
278
+ model_id = "NousResearch/Hermes-4.3-36B"
279
+
280
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
281
+ model = AutoModelForCausalLM.from_pretrained(
282
+ model_id,
283
+ torch_dtype=torch.float16,
284
+ device_map="auto"
285
+ )
286
+
287
+ messages = [
288
+ {"role":"system","content":"You are Hermes 4. Be concise."},
289
+ {"role":"user","content":"Summarize CRISPR in 3 sentences."}
290
+ ]
291
+
292
+ inputs = tokenizer.apply_chat_template(
293
+ messages, add_generation_prompt=True, return_tensors="pt"
294
+ ).to(model.device)
295
+
296
+ outputs = model.generate(
297
+ **inputs, max_new_tokens=400, temperature=0.6, top_p=0.95, top_k=20, do_sample=True
298
+ )
299
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
300
+ ```
301
+
302
+ For production serving on multi-GPU nodes, consider tensor parallel inference engines (e.g., SGLang/vLLM backends) with prefix caching.
303
+
304
+ ## Inference Providers:
305
+
306
+ ### Nous Portal:
307
+
308
+ <a href="https://portal.nousresearch.com"><img width=256 alt="chutes logo" src="https://cdn-uploads.huggingface.co/production/uploads/6317aade83d8d2fd903192d9/6YytY7N0mjCnBQvWo3qtv.png"></a>
309
+
310
+ ### Chutes:
311
+
312
+ <a href="https://chutes.ai/app"><img width=256 alt="chutes logo" src="https://cdn-uploads.huggingface.co/production/uploads/6317aade83d8d2fd903192d9/l14AWPv6cSvaprpwK_IWY.png"></a>
313
+
314
+ # Quantized / Smaller Variants
315
+
316
+ Hermes 4 is available as BF16 original weights as well as BF16 as well as GGUF variants.
317
+
318
+ GGUF Verions (4, 5, 6, and 8bit): https://huggingface.co/NousResearch/Hermes-4.3-36B-GGUF
319
+
320
+ See the Hermes 4 collection to explore them all:
321
+ https://huggingface.co/collections/NousResearch/hermes-4-collection-68a731bfd452e20816725728
322
+
323
+ # How to cite
324
+
325
+ ```bibtex
326
+ @misc{teknium2025hermes4technicalreport,
327
+ title={Hermes 4 Technical Report},
328
+ author={Ryan Teknium and Roger Jin and Jai Suphavadeeprasit and Dakota Mahan and Jeffrey Quesnelle and Joe Li and Chen Guang and Shannon Sands and Karan Malhotra},
329
+ year={2025},
330
+ eprint={2508.18255},
331
+ archivePrefix={arXiv},
332
+ primaryClass={cs.AI},
333
+ url={https://arxiv.org/abs/2508.18255},
334
+ }
335
  ```