Osaurus-AI commited on
Commit
7e4428c
·
verified ·
1 Parent(s): f18c97b

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +4 -4
README.md CHANGED
@@ -88,14 +88,14 @@ out = generate(model, tokenizer, prompt=prompt, max_tokens=4096,
88
  ### HumanEval+ (code generation)
89
 
90
  - **Dataset**: `evalplus/humanevalplus` test split (164 prompts, harder tests than HumanEval).
91
- - **Protocol**: greedy pass@1 baseline + pass@5 retry on failures.
92
- - **Sampling for pass@5**: temp=1.0, top_p=0.95, top_k=40 (MiniMax official); k=5 samples per failed problem, early stop on first pass.
93
  - **Grading**: each candidate run with 20s subprocess timeout; must pass ALL EvalPlus tests.
94
 
95
  | Metric | Score |
96
  |--------|-------|
97
- | **pass@1 (greedy)** | **71.95%** (118/164) |
98
- | **pass@5 (greedy + sampled retry)** | **89.02%** (146/164) |
99
 
100
  28 of 46 greedy failures recovered via sampling; 18 residual failures are genuine logic errors or prompts where 1200 tokens ran out mid-reasoning.
101
 
 
88
  ### HumanEval+ (code generation)
89
 
90
  - **Dataset**: `evalplus/humanevalplus` test split (164 prompts, harder tests than HumanEval).
91
+ - **Protocol**: sampled pass@1 baseline + pass@5 retry on failures.
92
+ - **Sampling for both pass@1 and pass@5 retry**: temp=1.0, top_p=0.95, top_k=40 (MiniMax official); max_tokens=5000 on pass@1, 1200 on pass@5; k=5 samples per failed problem, early stop on first pass.
93
  - **Grading**: each candidate run with 20s subprocess timeout; must pass ALL EvalPlus tests.
94
 
95
  | Metric | Score |
96
  |--------|-------|
97
+ | **pass@1 (sampled, temp=1.0)** | **71.95%** (118/164) |
98
+ | **pass@5 (sampled, retry of failures)** | **89.02%** (146/164) |
99
 
100
  28 of 46 greedy failures recovered via sampling; 18 residual failures are genuine logic errors or prompts where 1200 tokens ran out mid-reasoning.
101