EvalPlus for rigourous evaluation of LLM-synthesized code
EvalPlus(π) => π
π₯Quick Start β’ π»LLM code β’ π¨Tools β’ πCitation β’ πAcknowledgement
EvalPlus is a rigorous evaluation framework for LLM4Code, with:
Why EvalPlus?
Want to know more details? Read our NeurIPS'23 paper as well as our Google Slides!
[!Important]
π§ MBPP+ update (
v0.1.0
tov0.2.0
): We recently improved and stablized MBPP+ dataset by removing some tasks whosetest_list
is wrong (brought by the original MBPP dataset itself) to make it more reasonable to solve. Inv0.1.0
MBPP+ has 399 tasks while the newv0.2.0
has 378 tasks. We also improved the oracle. Therefore, usingv0.2.0
you might expect ~4pp pass@1 improvement for both base and plus tests.
[!Tip]
EvalPlus β€οΈ bigcode-evaluation-harness! HumanEval+ and MBPP+ have been integrated to bigcode-evaluation-harness that you can also run EvalPlus datasets there!
To get started, please first setup the environment:
pip install evalplus --upgrade
pip install "git+https://github.com/evalplus/evalplus.git" --upgrade
git clone https://github.com/evalplus/evalplus.git
cd evalplus
export PYTHONPATH=$PYTHONPATH:$(pwd)
pip install -r requirements.txt
Implement the GEN_SOLUTION
function by calling the LLM to produce the complete solution (include the code) and save the samples to samples.jsonl
:
from evalplus.data import get_[human_eval|mbpp]_plus, write_jsonl
samples = [
dict(task_id=task_id, solution=GEN_SOLUTION(problem["prompt"]))
for task_id, problem in get_[human_eval|mbpp]_plus().items()
]
write_jsonl("samples.jsonl", samples)
task_id
is the identifier string for the taskentry_point
is name of the functionprompt
is the function signature with docstringcanonical_solution
is the ground-truth implementation (re-implemented to fix bugs in HumanEval)base_input
is the test inputs in original HumanEvalplus_input
is the test inputs brought by EvalPlus[!Note]
Expected Schema of
samples.jsonl
task_id
: Task ID, which are the keys ofget_[human_eval|mbpp]_plus()
solution
(optional): Self-contained solution (usually including the prompt)
- Example:
{"task_id": "HumanEval/?", "solution": "def f():\n return 1"}
completion
(optional): Function body without prompt
- Example:
{"task_id": "HumanEval/?", "completion": " return 1"}
Only one of
solution
andcompletion
is required. If both are provided,solution
will be used. We also accept solutions in the form of directory, i.e.,--samples ${SAMPLE_DIR}
where${SAMPLE_DIR}
is organized as:${SAMPLE_DIR}/${TASK_ID}/{SAMPLE_ID}.py
(${TASK_ID} = task_id.replace("/", "_")
).
LLM-generated text may not be compilable code for including natural language lines or incomplete extra code.
We provide a tool namely evalplus.sanitize
to clean up the code:
# π‘ If you are storing codes in jsonl:
evalplus.sanitize --samples samples.jsonl
# Sanitized code will be produced to `samples-sanitized.jsonl`
# π‘ If you are storing codes in directories:
evalplus.sanitize --samples /path/to/vicuna-[??]b_temp_[??]
# Sanitized code will be produced to `/path/to/vicuna-[??]b_temp_[??]-sanitized`
To double-check the post-processing results, you can use evalplus.syncheck
to check the code validity before and after sanitization, which will print erroneous code snippets and why they are wrong:
# π‘ If you are storing codes in jsonl:
evalplus.syncheck --samples samples.jsonl --dataset [humaneval|mbpp]
# π‘ If you are storing codes in directories:
evalplus.syncheck --samples /path/to/vicuna-[??]b_temp_[??] --dataset [humaneval|mbpp]
You are strongly recommended to use a sandbox such as docker:
docker run -v $(pwd):/app ganler/evalplus:latest --dataset [humaneval|mbpp] --samples samples.jsonl
...Or if you want to try it locally regardless of the risks β οΈ:
evalplus.evaluate --dataset [humaneval|mbpp] --samples samples.jsonl
[!Tip]
Do you use a very slow machine?
LLM solutions are regarded as failed on timeout (and OOM etc.). Specifically, we set the timeout $T=\max(T_{base}, T_{gt}\times k)$, where:
- $T_{base}$ is the minimal timeout (configurable by
--min-time-limit
; default to 1s);- $T_{gt}$ is the runtime of the ground-truth solutions (achieved via profiling);
- $k$ is a configurable factor
--gt-time-limit-factor
(default to 4);If your machine is too slow and you are getting high-variance results, try to use larger $k$ and $T_{base}$.
Additionally, you are NOT encouraged to make your test-bed over stressed while running evaluation. For example, using
--parallel 64
on a 4-core machine or doing something else during evaluation are bad ideas...
export PYTHONPATH=$PYTHONPATH:$(pwd)
python evalplus/evaluate.py --dataset humaneval --samples samples.jsonl
--parallel
: by default half of the cores--base-only
(store_ture): only run base HumanEval tests--i-just-wanna-run
: force a re-runThe output should be like (below is GPT-4 greedy decoding example):
Computing expected output...
Expected outputs computed in 15.18s
Reading samples...
164it [00:04, 37.79it/s]
Evaluating samples...
100%|ββββββββββββββββββββββββββββββββββββββββββ| 164/164 [00:03<00:00, 44.75it/s]
Base
{'pass@1': 0.8841463414634146}
Base + Extra
{'pass@1': 0.768}
Base
is the pass@k
for the original HumanEvalBase + Extra
is the pass@k
for the our HumanEval+ (with extra tests)[1, 10, 100]
where k values <=
the sample size will be usedsamples_eval_results.jsonl
will be cached. Remove it to re-run the evaluationIf you do greedy decoding where there is only one sample for each task, the evaluation should take just a few seconds.
When running 200 samples x 164 tasks x ~700+ tests, it can take around 2-10 minutes by using --parallel 64
and --test-details
.
Here are some tips to speed up the evaluation:
--parallel $(nproc)
--test-details
if you just want to quickly get pass@k as --test-details
will run all tests (700+ on average for each task), while without --test-details
the testing for a sample stops immediately when it fails the first test.[!Tip]
π Try out
HumanEvalPlus-Mini
! which selects a minimal set of additional tests with the highest quality, achieving almost the same effectiveness of the full version. Just add a--mini
flag, it can run 23+% faster! (even faster if you evaluate all tests without fail-stop with--test-details
).docker run -v $(pwd):/app ganler/evalplus:latest --dataset humaneval --samples samples.jsonl --mini # ...Or locally β οΈ # evalplus.evaluate --dataset humaneval --samples samples.jsonl --mini
We also share pre-generated code samples from LLMs we have evaluated:
Each sample file is packaged in a zip file named like ${model_name}_temp_${temperature}.zip
.
You can unzip them to a folder named like ${model_name}_temp_${temperature}
and run the evaluation from scratch with:
evalplus.evaluate --dataset humaneval --samples ${model_name}_temp_${temperature}
To use these tools, please first install the repository from GitHub:
git clone https://github.com/evalplus/evalplus.git
cd evalplus
pip install -r tools/requirements.txt
We have configured the code generation of a wide range of LLMs (see support details in codegen/models.py). Example to run greedy generation on StarCoderBase-7B:
python codegen/generate.py --model starcoderbase-7b --bs 1 --temperature 0 --n_samples 1 --resume --greedy --root [result_path] --dataset [mbpp|humaneval]
Please check evalplus/inputgen.py
.
@inproceedings{evalplus,
title = {Is Your Code Generated by Chat{GPT} Really Correct? Rigorous Evaluation of Large Language Models for Code Generation},
author = {Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming},
booktitle = {Thirty-seventh Conference on Neural Information Processing Systems},
year = {2023},
url = {https://openreview.net/forum?id=1qvx610Cu7},
}