Osaurus-AI commited on
Commit
5776428
·
verified ·
1 Parent(s): 0b9d135

fix(eval): correct pass@1=81.10/pass@5=90.24 after extractor bug fix

Browse files
Files changed (1) hide show
  1. README.md +4 -3
README.md CHANGED
@@ -91,13 +91,14 @@ out = generate(model, tokenizer, prompt=prompt, max_tokens=4096,
91
  - **Protocol**: sampled pass@1 baseline + pass@5 retry on failures.
92
  - **Sampling for both pass@1 and pass@5 retry**: temp=1.0, top_p=0.95, top_k=40 (MiniMax official); max_tokens=5000 on pass@1, 1200 on pass@5; k=5 samples per failed problem, early stop on first pass.
93
  - **Grading**: each candidate run with 20s subprocess timeout; must pass ALL EvalPlus tests.
 
94
 
95
  | Metric | Score |
96
  |--------|-------|
97
- | **pass@1 (sampled, temp=1.0)** | **71.95%** (118/164) |
98
- | **pass@5 (sampled, retry of failures)** | **89.02%** (146/164) |
99
 
100
- 28 of 46 greedy failures recovered via sampling; 18 residual failures are genuine logic errors or prompts where 1200 tokens ran out mid-reasoning.
101
 
102
  ## Variants
103
 
 
91
  - **Protocol**: sampled pass@1 baseline + pass@5 retry on failures.
92
  - **Sampling for both pass@1 and pass@5 retry**: temp=1.0, top_p=0.95, top_k=40 (MiniMax official); max_tokens=5000 on pass@1, 1200 on pass@5; k=5 samples per failed problem, early stop on first pass.
93
  - **Grading**: each candidate run with 20s subprocess timeout; must pass ALL EvalPlus tests.
94
+ - **Extractor**: `jang_tools.kimi_prune.bench_humaneval._extract_code` (≥ 2026-04-24). The earlier extractor mis-paired markdown fences when the model emitted token-boundary glitches at the language tag (e.g. `\`\`\`python一致:`, `\`\`\`pythonfr`) and when the chat template prefilled `<think>` at the prompt boundary, costing roughly nine points of pass@1.
95
 
96
  | Metric | Score |
97
  |--------|-------|
98
+ | **pass@1 (sampled, temp=1.0)** | **81.10%** (133/164) |
99
+ | **pass@5 (sampled, retry of failures)** | **90.24%** (148/164) |
100
 
101
+ After the extractor fix, 30 of 46 originally-counted pass@1 failures resolve cleanly: 15 were correct answers eaten by fence-pairing, and another 15 recover under pass@5 sampling. The 16 residuals split into ~8 token-budget starvations (`no_code_block`), ~5 in-code 2-bit token-boundary glitches (`return False言`, `Nonef`, etc.), and ~3 genuine logic errors on EvalPlus hidden tests.
102
 
103
  ## Variants
104