EvalPlus for rigourous evaluation of LLM-synthesized code
evalplus.sanitize
to post-process LLM generated code as much as possibleevalplus.syncheck
to check the Python compilability of LLM generated codePyPI: https://pypi.org/project/evalplus/0.2.2/ Docker Hub: https://hub.docker.com/layers/ganler/evalplus/v0.2.2/images/sha256-b9b2055b8380a8cdddd71b4355c56c13fe37d930cd46d0815e140b7dbe045dd2
tools
CACHE_DIR
nonexistance issueeval_results.json
for readabilityEVALPLUS_TIMEOUT_PER_TASK
env var to set the maximum testing time for each taskinputgen.py
HumanEval/32
: fixes the oraclePyPI: https://pypi.org/project/evalplus/0.2.1/ Docker Hub: https://hub.docker.com/layers/ganler/evalplus/v0.2.1/images/sha256-2bb315e40ea502b4f47ebf1f93561ef88280d251bdc6f394578c63d90e1825d7
MBPP is a dataset curated by Google. Its full set includes around 1000 crowd-sourced Python programming problems. However, certain amount of problems can be noisy (e.g., prompts make no sense or tests are broken). Consequently, a subset (~427 problems) of the data has been hand-verified by original author -- MBPP-sanitized
.
MBPP+ improves MBPP based on its sanitized version (MBPP-sanitized
):
--dataset mbpp
for evalplus.evaluate
, codegen/generate.py
, tools/checker.py
as well as tools/sanitize.py
A typical workflow to use MBPP+:
# Step 1: Generate MBPP solutions
from evalplus.data import get_mbpp_plus, write_jsonl
def GEN_SOLUTION(prompt: str) -> str:
# LLM produce the whole solution based on prompt
samples = [
dict(task_id=task_id, solution=GEN_SOLUTION(problem["prompt"]))
for task_id, problem in get_mbpp_plus().items()
]
write_jsonl("samples.jsonl", samples)
# May perform some post-processing to sanitize LLM produced code
# e.g., https://github.com/evalplus/evalplus/blob/master/tools/sanitize.py
# Step 2: Evaluation on MBPP+
docker run -v $(pwd):/app ganler/evalplus:latest --dataset mbpp --samples samples.jsonl
# STDOUT will display the scores for "base" (with MBPP tests) and "base + plus" (with additional MBPP+ tests)
v0.1.9
from v0.1.6
PyPI: https://pypi.org/project/evalplus/0.2.0/ Docker Hub: https://hub.docker.com/layers/ganler/evalplus/v0.2.0/images/sha256-6f1b9bd13930abfb651a99d4c6a55273271f73e5b44c12dcd959a00828782dd6
HUMANEVAL_OVERRIDE_PATH
which allows to override the original dataset with customized datasetPyPI: https://pypi.org/project/evalplus/0.1.7/ Docker Hub: https://hub.docker.com/layers/ganler/evalplus/v0.1.7/images/sha256-69fe87df89b8c1545ff7e3b20232ac6c4841b43c20f22f4a276ba03f1b0d79ae
--min-time-limit
; default to 0.2s);--gt-time-limit-factor
(default to 4);HumanEval+
dataset bug fixes:
PyPI: https://pypi.org/project/evalplus/0.1.6/ Docker Hub: https://hub.docker.com/layers/ganler/evalplus/v0.1.6/images/sha256-5913b95172962ad61e01a5d5cf63b60e1140dd547f5acc40370af892275e777c
HumanEval+[mini]
-- 47x smaller while equivalently effective as HumanEval+--mini
to evalplus.evaluate ...
you can use a minimal and best-quality set of extra tests to accelerate evaluation!HumanEval+[mini]
(avg 16.5 tests) is smaller than HumanEval+
(avg 774.8 tests) by 47x.PyPI: https://pypi.org/project/evalplus/0.1.5/ Docker Hub: https://hub.docker.com/layers/ganler/evalplus/v0.1.5/images/sha256-01ef3275ab02776e94edd4a436a3cd33babfaaf7a81e7ae44f895c2794f4c104
ProcessPoolExecutor
over ThreadPoolExecutor
PyPI: https://pypi.org/project/evalplus/0.1.4/ Docker Hub: https://hub.docker.com/layers/ganler/evalplus/v0.1.4/images/sha256-a0ea8279c71afa9418808326412b1e5cd11f44b3b59470477ecf4ba999d4b73a
.jsonl
PyPI: https://pypi.org/project/evalplus/0.1.3/ Docker Hub: https://hub.docker.com/layers/ganler/evalplus/v0.1.3/images/sha256-fd13ab6ee2aa313eb160fc29debe8c761804cb6af7309280b4e200b6549bd75a
--base-only
pip install
PyPI: https://pypi.org/project/evalplus/0.1.2/ Docker Hub: https://hub.docker.com/layers/ganler/evalplus/v0.1.2/images/sha256-747ae02f0bfbd300c0205298113006203d984373e6ab6b8fb3048626f41dbe08
In this version, efforts are mainly made to sanitize and standardize code in evalplus
. Most importantly, evalplus
strictly follows the dataset usage style of HumanEval. As a result, users can use evalplus
in this way:
For more details, the main changes are (tracked in #1):
.jsonl
get_human_eval_plus()
returns a dict
instead of list
"/"
over "_"
PyPI: https://pypi.org/project/evalplus/0.1.1/ Docker Hub: https://hub.docker.com/layers/ganler/evalplus/v0.1.1/images/sha256-4993a0dc0ec13d6fe88eb39f94dd0a927e1f26864543c8c13e2e8c5d5c347af0