Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -88,14 +88,14 @@ out = generate(model, tokenizer, prompt=prompt, max_tokens=4096,
|
|
| 88 |
### HumanEval+ (code generation)
|
| 89 |
|
| 90 |
- **Dataset**: `evalplus/humanevalplus` test split (164 prompts, harder tests than HumanEval).
|
| 91 |
-
- **Protocol**:
|
| 92 |
-
- **Sampling for pass@5**: temp=1.0, top_p=0.95, top_k=40 (MiniMax official); k=5 samples per failed problem, early stop on first pass.
|
| 93 |
- **Grading**: each candidate run with 20s subprocess timeout; must pass ALL EvalPlus tests.
|
| 94 |
|
| 95 |
| Metric | Score |
|
| 96 |
|--------|-------|
|
| 97 |
-
| **pass@1 (
|
| 98 |
-
| **pass@5 (
|
| 99 |
|
| 100 |
28 of 46 greedy failures recovered via sampling; 18 residual failures are genuine logic errors or prompts where 1200 tokens ran out mid-reasoning.
|
| 101 |
|
|
|
|
| 88 |
### HumanEval+ (code generation)
|
| 89 |
|
| 90 |
- **Dataset**: `evalplus/humanevalplus` test split (164 prompts, harder tests than HumanEval).
|
| 91 |
+
- **Protocol**: sampled pass@1 baseline + pass@5 retry on failures.
|
| 92 |
+
- **Sampling for both pass@1 and pass@5 retry**: temp=1.0, top_p=0.95, top_k=40 (MiniMax official); max_tokens=5000 on pass@1, 1200 on pass@5; k=5 samples per failed problem, early stop on first pass.
|
| 93 |
- **Grading**: each candidate run with 20s subprocess timeout; must pass ALL EvalPlus tests.
|
| 94 |
|
| 95 |
| Metric | Score |
|
| 96 |
|--------|-------|
|
| 97 |
+
| **pass@1 (sampled, temp=1.0)** | **71.95%** (118/164) |
|
| 98 |
+
| **pass@5 (sampled, retry of failures)** | **89.02%** (146/164) |
|
| 99 |
|
| 100 |
28 of 46 greedy failures recovered via sampling; 18 residual failures are genuine logic errors or prompts where 1200 tokens ran out mid-reasoning.
|
| 101 |
|