tperes commited on
Commit
63b727b
·
verified ·
1 Parent(s): 1b1da24

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +148 -42
README.md CHANGED
@@ -177,17 +177,27 @@ The model uses the standard ChatML format:
177
 
178
  Apache 2.0
179
 
 
 
180
  ---
181
 
182
- # Original model Card: palmyra-mini-thinking-b
183
 
184
- ## Model Details
 
 
 
185
 
186
- **Model Name:** palmyra-mini-thinking-b
 
 
187
 
188
- **Version:** 1.0
189
 
190
- **Type:** Generative AI Language Model
 
 
 
 
191
 
192
  ## Introduction
193
 
@@ -201,44 +211,140 @@ The model's mathematical abilities are particularly noteworthy. It achieves an i
201
 
202
  Beyond mathematics, Palmyra-mini-thinking-b demonstrates strong performance in the competitive programming arena. Its score of 0.6343 on the Codeforces (pass_rate) benchmark underscores its ability to understand complex algorithmic problems and generate correct, efficient code. This capability suggests the model is well-suited for tasks involving code generation, debugging, and algorithmic design, making it a valuable asset for software developers and computer science researchers.
203
 
204
- ## Benchmark Scores
205
-
206
- | Benchmark | Score |
207
- |:-----------------------------------------------------------------|---------:|
208
- | gsm8k (strict-match) | 0.4268 |
209
- | minerva_math(exact_match) | 0.0708 |
210
- | mmlu_pro(exact_match) | 0.2926 |
211
- | hendrycks_math | 0.0016 |
212
- | ifeval (inst_level_loose_acc) | 0.3297 |
213
- | mathqa (acc) | 0.3045 |
214
- | humaneval (pass@1) | 0.0732 |
215
- | BBH (get-answer)(exact_match) | 0.288 |
216
- | mbpp | 0.168 |
217
- | leadboard_musr (acc_norm) | 0.3796 |
218
- | gpqa lighteval gpqa diamond_pass@1:8_samples | 0.3958 |
219
- | AIME24(pass@1)(avg-of-1) | 0.6 |
220
- | AIME25(pass@1)(avg-of-1) | 0.5 |
221
- | Livecodebench-codegen (livecodebench/code_generation_lite v4_v5) | 0.2873 |
222
- | AMC23 | 0.925 |
223
- | MATH500 | 0.882 |
224
- | Minerva | 0.2941 |
225
- | Olympiadbench (extractive_match) | 0.5733 |
226
- | Codecontests (pass_rate) | 0.2018 |
227
- | Codeforces (pass_rate) | 0.6343 |
228
- | Taco (pass_rate) | 0.3456 |
229
- | APPS (all_levels) | 0.0584 |
230
- | HMMT23 (extractive_match) | 0.2333 |
231
- | Average | 0.359378 |
232
-
233
-
234
- ## Intended Use
235
-
236
- This model is intended for research and development in the field of generative AI, particularly for tasks requiring mathematical and logical reasoning.
237
-
238
- ## Limitations
239
-
240
- The model's performance has been evaluated on a specific set of benchmarks. Its performance on other tasks or in real-world applications may vary.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
241
 
242
  ## Ethical Considerations
243
 
244
  As with any language model, there is a potential for generating biased or inaccurate information. Users should be aware of these limitations and use the model responsibly.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
177
 
178
  Apache 2.0
179
 
180
+ #### Original model card below:
181
+
182
  ---
183
 
 
184
 
185
+ <div align="center">
186
+ <h1>Palmyra-mini-thinking-b</h1>
187
+
188
+ </div>
189
 
190
+ <p align="center">
191
+ <img src="https://huggingface.co/Writer/palmyra-mini-thinking-b/resolve/main/logo-mini-b%20benchmark-performance.png?download=true" width="800"/>
192
+ </p>
193
 
194
+ ### Model Description
195
 
196
+ - **Language(s) (NLP):** English
197
+ - **License:** Apache-2.0
198
+ - **Finetuned from model:** Qwen/Qwen2.5-1.5B
199
+ - **Context window:** 131,072 tokens
200
+ - **Parameters:** 1.7 billion
201
 
202
  ## Introduction
203
 
 
211
 
212
  Beyond mathematics, Palmyra-mini-thinking-b demonstrates strong performance in the competitive programming arena. Its score of 0.6343 on the Codeforces (pass_rate) benchmark underscores its ability to understand complex algorithmic problems and generate correct, efficient code. This capability suggests the model is well-suited for tasks involving code generation, debugging, and algorithmic design, making it a valuable asset for software developers and computer science researchers.
213
 
214
+ ## Benchmark Scores (sampling params: temperature:0.6, top_p:0.95)
215
+
216
+ Pass@1(avg-of-64)
217
+
218
+ | Benchmark | Pass@1 (avg-of-64) | Majority@64 |
219
+ | :-------- | :------------------- | :----------- |
220
+ | AIME24 | 59.43% | 71.67% |
221
+ | AIME25 | 49.69% | 60.00% |
222
+ | GPQA | 42.01% | 47.22% |
223
+ | HMMT25 | 27.86% | 30.00% |
224
+ | HLE | 5.22% | N/A |
225
+ | MMLU-PRO | 55.49% | 60.60% |
226
+ | MATH500 | 93.80% | 95.40% |
227
+ | LCB | 34.51% | N/A |
228
+
229
+ LCB here is version v6_2408_2505
230
+
231
+
232
+ Pass@1(avg-of-1)
233
+
234
+ | Benchmark | Score (%) |
235
+ |:-----------------------------------------------------------------|------------:|
236
+ | GSM8K (strict-match) | 42.68% |
237
+ | Minerva Math (exact match) | 7.08% |
238
+ | MMLU-PRO (exact match) | 29.26% |
239
+ | MATH (Hendrycks) | 0.16% |
240
+ | IFEval (inst_level_loose_acc) | 32.97% |
241
+ | MathQA (acc) | 30.45% |
242
+ | HumanEval (pass@1) | 7.32% |
243
+ | BBH (get-answer)(exact match) | 28.80% |
244
+ | MBPP | 16.80% |
245
+ | GPQA (diamond, pass@1: 8 samples) | 39.58% |
246
+ | AIME24 (pass@1)(avg-of-1) | 60.00% |
247
+ | AIME25 (pass@1)(avg-of-1) | 50.00% |
248
+ | Livecodebench-codegen (livecodebench/code_generation_lite v4_v5) | 28.73% |
249
+ | AMC23 | 92.50% |
250
+ | MATH500 | 88.20% |
251
+ | Minerva | 29.41% |
252
+ | Olympiadbench (extractive_match) | 57.33% |
253
+ | Codecontests (pass_rate) | 20.18% |
254
+ | Codeforces (pass_rate) | 63.43% |
255
+ | Taco (pass_rate) | 34.56% |
256
+ | APPS (all_levels) | 5.84% |
257
+ | HMMT (Feb 2025) (extractive_match) | 23.33% |
258
+ | Average | 35.94% |
259
+
260
+ ### Use with transformers
261
+
262
+ You can run conversational inference using the Transformers Auto classes with the `generate()` function. Here's an example:
263
+
264
+ ```py
265
+ import torch
266
+ from transformers import AutoTokenizer, AutoModelForCausalLM
267
+
268
+ model_id = "Writer/palmyra-mini-thinking-b"
269
+
270
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
271
+
272
+ model = AutoModelForCausalLM.from_pretrained(
273
+ model_id,
274
+ torch_dtype=torch.float16,
275
+ device_map="auto",
276
+ attn_implementation="flash_attention_2",
277
+ )
278
+
279
+ messages = [
280
+ {
281
+ "role": "user",
282
+ "content": "You have a 3-liter jug and a 5-liter jug. How can you measure exactly 4 liters of water?"
283
+ }
284
+ ],
285
+
286
+ input_ids = tokenizer.apply_chat_template(
287
+ messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
288
+ )
289
+
290
+ gen_conf = {
291
+ "max_new_tokens": 256,
292
+ "eos_token_id": tokenizer.eos_token_id,
293
+ "temperature": 0.3,
294
+ "top_p": 0.9,
295
+ }
296
+
297
+ with torch.inference_mode():
298
+ output_id = model.generate(input_ids, **gen_conf)
299
+
300
+ output_text = tokenizer.decode(output_id[0][input_ids.shape[1] :])
301
+
302
+ print(output_text)
303
+ ```
304
+
305
+ ## Running with vLLM
306
+ ```py
307
+ vllm serve Writer/palmyra-mini-thinking-b
308
+ ```
309
+ ```py
310
+ curl -X POST http://localhost:8000/v1/chat/completions \
311
+ -H "Content-Type: application/json" \
312
+ -d '{
313
+ "model": "Writer/palmyra-mini-thinking-b",
314
+ "messages": [
315
+ {
316
+ "role": "user",
317
+ "content": "You have a 3-liter jug and a 5-liter jug. How can you measure exactly 4 liters of water?"
318
+ }
319
+ ],
320
+ "max_tokens": 8000,
321
+ "temperature": 0.2
322
+ }'
323
+ ```
324
 
325
  ## Ethical Considerations
326
 
327
  As with any language model, there is a potential for generating biased or inaccurate information. Users should be aware of these limitations and use the model responsibly.
328
+
329
+
330
+ ### Footnotes
331
+
332
+ - Base model: This model builds on NVIDIA's OpenReasoning-Nemotron-1.5B (`https://huggingface.co/nvidia/OpenReasoning-Nemotron-1.5B`).
333
+ - Evaluation methodology:
334
+ - Pass@1 (avg-of-1): computed using `lm_eval` and `lighteval`.
335
+ - Pass@1 (avg-of-64) and Majority@64: computed using `nemoskills`.
336
+
337
+ ### Citation and Related Information
338
+
339
+
340
+ To cite this model:
341
+ ```
342
+ @misc{Palmyra-mini-thinking-b,
343
+ author = {Writer Engineering team},
344
+ title = {{Palmyra-mini: A powerful LLM designed for math and coding}},
345
+ howpublished = {\url{https://dev.writer.com}},
346
+ year = 2025,
347
+ month = Sep
348
+ }
349
+ ```
350
+ Contact Hello@writer.com