Evalplus Save

EvalPlus for rigourous evaluation of LLM-synthesized code

Project README

EvalPlus(πŸ“–) => πŸ“š

πŸ”₯Quick Start β€’ πŸ’»LLM code β€’ πŸ”¨Tools β€’ πŸ“œCitation β€’ πŸ™Acknowledgement

About

EvalPlus is a rigorous evaluation framework for LLM4Code, with:

  • ✨ HumanEval+: 80x more tests than the original HumanEval!
  • ✨ MBPP+: 35x more tests than the original MBPP!
  • ✨ Evaluation framework: our packages/images/tools can easily and safely evaluate LLMs on above benchmarks.

Why EvalPlus?

  • ✨ Precise evaluation & ranking: See our leaderboard for latest LLM rankings before & after rigorous evaluation.
  • ✨ Coding rigorousness: Look at the score differences! esp. before and after using EvalPlus tests! Less drop is better as it means more rigorousness and less laxity in code generation; while a big drop means the generated code tends to be fragile.
  • ✨ Pre-generated samples: EvalPlus accelerates LLM4Code research by open-sourcing LLM-generated samples for various models -- no need to re-run the expensive benchmarks!

Want to know more details? Read our NeurIPS'23 paper as well as our Google Slides!

[!Important]

🚧 MBPP+ update (v0.1.0 to v0.2.0): We recently improved and stablized MBPP+ dataset by removing some tasks whose test_list is wrong (brought by the original MBPP dataset itself) to make it more reasonable to solve. In v0.1.0 MBPP+ has 399 tasks while the new v0.2.0 has 378 tasks. We also improved the oracle. Therefore, using v0.2.0 you might expect ~4pp pass@1 improvement for both base and plus tests.

πŸ”₯ Quick Start

[!Tip]

EvalPlus ❀️ bigcode-evaluation-harness! HumanEval+ and MBPP+ have been integrated to bigcode-evaluation-harness that you can also run EvalPlus datasets there!

To get started, please first setup the environment:

pip install evalplus --upgrade
⏬ Install nightly version :: click to expand ::
pip install "git+https://github.com/evalplus/evalplus.git" --upgrade
⏬ Using EvalPlus as a local repo? :: click to expand ::
git clone https://github.com/evalplus/evalplus.git
cd evalplus
export PYTHONPATH=$PYTHONPATH:$(pwd)
pip install -r requirements.txt

Code generation

Implement the GEN_SOLUTION function by calling the LLM to produce the complete solution (include the code) and save the samples to samples.jsonl:

from evalplus.data import get_[human_eval|mbpp]_plus, write_jsonl

samples = [
    dict(task_id=task_id, solution=GEN_SOLUTION(problem["prompt"]))
    for task_id, problem in get_[human_eval|mbpp]_plus().items()
]
write_jsonl("samples.jsonl", samples)
πŸ€” Structure of `problem`? :: click to expand ::
  • task_id is the identifier string for the task
  • entry_point is name of the function
  • prompt is the function signature with docstring
  • canonical_solution is the ground-truth implementation (re-implemented to fix bugs in HumanEval)
  • base_input is the test inputs in original HumanEval
  • plus_input is the test inputs brought by EvalPlus

[!Note]

Expected Schema of samples.jsonl

  1. task_id: Task ID, which are the keys of get_[human_eval|mbpp]_plus()
  2. solution (optional): Self-contained solution (usually including the prompt)
    • Example: {"task_id": "HumanEval/?", "solution": "def f():\n return 1"}
  3. completion (optional): Function body without prompt
    • Example: {"task_id": "HumanEval/?", "completion": " return 1"}

Only one of solution and completion is required. If both are provided, solution will be used. We also accept solutions in the form of directory, i.e., --samples ${SAMPLE_DIR} where ${SAMPLE_DIR} is organized as: ${SAMPLE_DIR}/${TASK_ID}/{SAMPLE_ID}.py (${TASK_ID} = task_id.replace("/", "_")).

Code post-processing

LLM-generated text may not be compilable code for including natural language lines or incomplete extra code. We provide a tool namely evalplus.sanitize to clean up the code:

# πŸ’‘ If you are storing codes in jsonl:
evalplus.sanitize --samples samples.jsonl
# Sanitized code will be produced to `samples-sanitized.jsonl`

# πŸ’‘ If you are storing codes in directories:
evalplus.sanitize --samples /path/to/vicuna-[??]b_temp_[??]
# Sanitized code will be produced to `/path/to/vicuna-[??]b_temp_[??]-sanitized`
πŸ”Ž Checking the compilability of post-processed code:: click to expand ::

To double-check the post-processing results, you can use evalplus.syncheck to check the code validity before and after sanitization, which will print erroneous code snippets and why they are wrong:

# πŸ’‘ If you are storing codes in jsonl:
evalplus.syncheck --samples samples.jsonl --dataset [humaneval|mbpp]

# πŸ’‘ If you are storing codes in directories:
evalplus.syncheck --samples /path/to/vicuna-[??]b_temp_[??] --dataset [humaneval|mbpp]

Code evaluation

You are strongly recommended to use a sandbox such as docker:

docker run -v $(pwd):/app ganler/evalplus:latest --dataset [humaneval|mbpp] --samples samples.jsonl

...Or if you want to try it locally regardless of the risks ⚠️:

evalplus.evaluate --dataset [humaneval|mbpp] --samples samples.jsonl

[!Tip]

Do you use a very slow machine?

LLM solutions are regarded as failed on timeout (and OOM etc.). Specifically, we set the timeout $T=\max(T_{base}, T_{gt}\times k)$, where:

  • $T_{base}$ is the minimal timeout (configurable by --min-time-limit; default to 1s);
  • $T_{gt}$ is the runtime of the ground-truth solutions (achieved via profiling);
  • $k$ is a configurable factor --gt-time-limit-factor (default to 4);

If your machine is too slow and you are getting high-variance results, try to use larger $k$ and $T_{base}$.

Additionally, you are NOT encouraged to make your test-bed over stressed while running evaluation. For example, using --parallel 64 on a 4-core machine or doing something else during evaluation are bad ideas...

πŸ€” Evaluate with local GitHub repo? :: click to expand ::
export PYTHONPATH=$PYTHONPATH:$(pwd)
python evalplus/evaluate.py --dataset humaneval --samples samples.jsonl
⌨️ More command-line flags :: click to expand ::
  • --parallel: by default half of the cores
  • --base-only (store_ture): only run base HumanEval tests
  • --i-just-wanna-run: force a re-run

The output should be like (below is GPT-4 greedy decoding example):

Computing expected output...
Expected outputs computed in 15.18s
Reading samples...
164it [00:04, 37.79it/s]
Evaluating samples...
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 164/164 [00:03<00:00, 44.75it/s]
Base
{'pass@1': 0.8841463414634146}
Base + Extra
{'pass@1': 0.768}
  • Base is the pass@k for the original HumanEval
  • Base + Extra is the pass@k for the our HumanEval+ (with extra tests)
  • The "k" includes [1, 10, 100] where k values <= the sample size will be used
  • A cache file named like samples_eval_results.jsonl will be cached. Remove it to re-run the evaluation
πŸ€” How long it would take? :: click to expand ::

If you do greedy decoding where there is only one sample for each task, the evaluation should take just a few seconds. When running 200 samples x 164 tasks x ~700+ tests, it can take around 2-10 minutes by using --parallel 64 and --test-details. Here are some tips to speed up the evaluation:

  • Use --parallel $(nproc)
  • Do NOT use --test-details if you just want to quickly get pass@k as --test-details will run all tests (700+ on average for each task), while without --test-details the testing for a sample stops immediately when it fails the first test.
  • Use our pre-evaluated results (see LLM-generated code)
  • Use HumanEval+ Mini

[!Tip]

πŸš€ Try out HumanEvalPlus-Mini! which selects a minimal set of additional tests with the highest quality, achieving almost the same effectiveness of the full version. Just add a --mini flag, it can run 23+% faster! (even faster if you evaluate all tests without fail-stop with --test-details).

docker run -v $(pwd):/app ganler/evalplus:latest --dataset humaneval --samples samples.jsonl --mini
# ...Or locally ⚠️
# evalplus.evaluate --dataset humaneval --samples samples.jsonl --mini

πŸ’» LLM-generated code

We also share pre-generated code samples from LLMs we have evaluated:

Each sample file is packaged in a zip file named like ${model_name}_temp_${temperature}.zip. You can unzip them to a folder named like ${model_name}_temp_${temperature} and run the evaluation from scratch with:

evalplus.evaluate --dataset humaneval --samples ${model_name}_temp_${temperature}

πŸ”¨ Useful tools

To use these tools, please first install the repository from GitHub:

git clone https://github.com/evalplus/evalplus.git
cd evalplus
pip install -r tools/requirements.txt

Code generation

We have configured the code generation of a wide range of LLMs (see support details in codegen/models.py). Example to run greedy generation on StarCoderBase-7B:

python codegen/generate.py --model starcoderbase-7b --bs 1 --temperature 0 --n_samples 1 --resume --greedy --root [result_path] --dataset [mbpp|humaneval]

Test input generation using EvalPlus

Please check evalplus/inputgen.py.

πŸ“œ Citation

@inproceedings{evalplus,
  title = {Is Your Code Generated by Chat{GPT} Really Correct? Rigorous Evaluation of Large Language Models for Code Generation},
  author = {Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming},
  booktitle = {Thirty-seventh Conference on Neural Information Processing Systems},
  year = {2023},
  url = {https://openreview.net/forum?id=1qvx610Cu7},
}

πŸ™ Acknowledgement

Open Source Agenda is not affiliated with "Evalplus" Project. README Source: evalplus/evalplus
Stars
863
Open Issues
26
Last Commit
1 week ago
Repository
License

Open Source Agenda Badge

Open Source Agenda Rating