Opencompass Versions Save

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

0.2.5.rc1

2 weeks ago

0.2.4

1 month ago

The OpenCompass team is thrilled to announce the release of OpenCompass v0.2.4!

๐ŸŒŸ Highlights

  • Enhanced support for multiple datasets including QuALITY, APPS and TACO.
  • Introducing multi-model judging for subjective test.
  • Bug fixes and improvements in configurations and documentation.

๐Ÿš€ New Features

๐ŸŒ General

  1. Feat #963 - Support for APPS dataset.
  2. Feature #976 - Add the implementation of QuALITY datasets.
  3. Feature #984 - Add support for setting prediction paths.
  4. Feature #1006 - Support alpacaeval_v2.
  5. Feature #1016 - Add multi-model judge.
  6. Feature #1019 - Add ATC Choice Version.

๐Ÿ“– Documentation

  1. Updates docs #1015 - General documentation updates and improvements.

๐Ÿ› Bug Fixes

  1. Fix #964 - Fix the config's name of deepseek-coder.
  2. Fix #890 - Update links and link checkers.
  3. Fix #977 - Fix a bug in internlm2 series configs.
  4. Fix #975 - Fix documentation issues.
  5. Fix #992 - Fix running issues in turbomind_tis.
  6. Fix #994 - Change status to list in base.py.
  7. Fix #995, Fix #1020 - Quick fixes and refactors for configs.

โš™ Enhancements and Refactors

  1. Modify requirements/runtime.txt #983 - Update numpy version requirement.
  2. Update Needlebench and configs #986 - Enhancements in Needlebench configurations.
  3. Simplify needlebench summarizer #1024 - Streamline Needlebench summarizer for better efficiency.

๐ŸŽ‰ Welcome New Contributors

  • @seanzhang-zhichen, @kleinzcy, @ispobock, @Chaseldot, and @Y0oMu made their first contributions. Welcome to the OpenCompass community!

๐Ÿ”— Full Change Logs

[Fix] fix the config's name of deepseek-coder by @jingmingzhuo in https://github.com/open-compass/opencompass/pull/964 [Fix] Update links and link checkers by @Leymore in https://github.com/open-compass/opencompass/pull/890 [Feat] support apps by @Connor-Shen in https://github.com/open-compass/opencompass/pull/963 fix doc problem by @seanzhang-zhichen in https://github.com/open-compass/opencompass/pull/975 [Fix] fix a bug in internlm2 series configs by @jingmingzhuo in https://github.com/open-compass/opencompass/pull/977 [Feature] Add the implement of QuALITY datasets by @jingmingzhuo in https://github.com/open-compass/opencompass/pull/976 modify the requirements/runtime.txt: numpy==1.23.4 --> numpy>=1.23.4 by @kleinzcy in https://github.com/open-compass/opencompass/pull/983 [Feature] add support for set prediction path by @bittersweet1999 in https://github.com/open-compass/opencompass/pull/984 [Feat] Support TACO by @Connor-Shen in https://github.com/open-compass/opencompass/pull/966 [Feature] update apps by @Connor-Shen in https://github.com/open-compass/opencompass/pull/985 [Fix] update apps/taco by @Connor-Shen in https://github.com/open-compass/opencompass/pull/988 [Feature] add one script for subjective by @bittersweet1999 in https://github.com/open-compass/opencompass/pull/993 Fix running issues in turbomind_tis by @ispobock in https://github.com/open-compass/opencompass/pull/992 [Fix] base.py change status into list by @Chaseldot in https://github.com/open-compass/opencompass/pull/994 [Fix] quick fix for configs by @bittersweet1999 in https://github.com/open-compass/opencompass/pull/995 [Feature] update needlebench and configs by @DseidLi in https://github.com/open-compass/opencompass/pull/986 [Feature] support alpacaeval_v2 by @bittersweet1999 in https://github.com/open-compass/opencompass/pull/1006 updates docs by @Y0oMu in https://github.com/open-compass/opencompass/pull/1015 [Feature] Add multi-model judge and fix some problems by @bittersweet1999 in https://github.com/open-compass/opencompass/pull/1016 [Fix] Refactor Needlebench Configs for CLI Testing Support by @DseidLi in https://github.com/open-compass/opencompass/pull/1020 [Feature] Add ATC Choice Version by @DseidLi in https://github.com/open-compass/opencompass/pull/1019 [Fix] Simplify needlebench summarizer by @DseidLi in https://github.com/open-compass/opencompass/pull/1024

For a detailed overview of all changes, check out our Full Changelog.

0.2.4.rc1

1 month ago

Provide with more parsed datasets:

OpenCompassData-complete-20240325.zip

Important updates compared to previous version are as follow:

Subjective: Add MTBench LongText: Support Needle-In-Haystack Test Dataset Code: Update generation version of CIBench

0.2.3

1 month ago

The OpenCompass team is thrilled to announce the release of OpenCompass v0.2.3! This version is packed with new features, crucial fixes, and documentation updates to improve your experience. We're continuously working to enhance OpenCompass, making it more robust and versatile for all users.

๐ŸŒŸ Highlights:

  • Enhanced Model Support: Introduction of new models and configurations, including support for the LightllmApi, lmdeploy pytorch engine, and more.
  • New Datasets and Benchmarks: Expanding our dataset repository with additions like OpenFinData, lveval benchmark, and an upgrade to Needlebench.
  • Documentation and Sync Improvements: Updated dataset pack URLs, fixed documentation errors, and synchronized with internal codes for consistency.

Explore the key updates in this release:

๐ŸŒŸ New Features:

  • ๐Ÿ“ฆ Dataset and Benchmark Expansion:

    • Support for new datasets like OpenFinData and an upgrade to Needlebench, offering broader evaluation capabilities (#896, #913).
    • Introduction of the lveval benchmark to enrich the evaluation landscape (#914).
  • ๐Ÿ›  Model and API Integrations:

    • Enhanced functionality with support for LightllmApi input_format and prompt templates, alongside the introduction of get_ppl for TurbomindModel (#888, #878).
    • New model configurations added, including support for gemini and deepseek-coder, further broadening the tools available for users (#931, #943).
  • ๐Ÿ“– Documentation and Sync Updates:

    • Updated dataset pack URLs and rank link in README to ensure users have access to the latest resources (#922, #911).
    • Several syncs with internal codes and GitHub blacklist update to maintain consistency and integrity (#929, #953).

๐Ÿ› Bug Fixes:

  • Addressed various configuration and template issues to ensure smoother operation across different models and benchmarks (#894, #893).
  • Fixed issues related to IFEval, including type hints and config bugs, enhancing evaluation accuracy and functionality (#906, #915).

๐ŸŽ‰ Welcome New Contributors:

  • We're delighted to welcome our new contributors: @xu-song, @x22x22, @yuantao2108, and @fanqiNO1. Your contributions are invaluable to the growth of OpenCompass!

๐Ÿ”— Full Changelog

For a detailed overview of all changes, check out our Full Changelog.

0.2.2

3 months ago

Welcome to OpenCompass v0.2.2, a release brimming with new features, essential fixes, and significant improvements across the board. With a focus on enhancing functionality and expanding dataset support, this update underscores our commitment to providing a robust platform for our users.

๐ŸŒŸ Highlights:

  • Broadened Dataset Support: Introduction of diverse datasets like T-Eval, CIBench, IFEval, and NPHardEval, and more, broadening the horizons for research and evaluation.
  • API Integrations and Updates: New support for APIs like Nanbeige and updates to existing ones such as Zhipu and Sensetime, enhancing model interaction capabilities.
  • Dataset Collection Release: Integrated dataset collection is availabe in 0.2.2.rc1. Dataset used in OpenCompass 2.0 leaderboard is NOT included in this collection.

Dive into what's new and improved:

๐ŸŒŸ New Features:

  • ๐Ÿ“ฆ Datasets Expansion:

    • Addition of multiple new datasets and evaluations, including T-Eval, CIBench, IFEval, and NPHardEval, offering more versatility for users (#813, #829, #809, #835).
  • ๐Ÿ›  API and Model Enhancements:

    • Support for new APIs like Nanbeige and updates to enhance the functionality of existing ones (#786, #847, #834).
    • Configurations and support for models and evaluators have been improved and expanded (#791, #812, #845).
  • ๐Ÿ“– Documentation and CI Enhancements:

    • Updated FAQs, contribution guides, and added new test runners to improve CI/CD processes (#830, #751, #874).

๐Ÿ› Bug Fixes:

  • Various fixes have been applied to address issues across datasets, evaluators, and configurations, ensuring a smoother experience for all users (#787, #788, #789).

๐ŸŽ‰ Welcome New Contributors:

  • We're excited to welcome our new contributors: @notoschord, @zhulinJulia24, @QipengGuo, @RangiLyu, @del-zhenwu, and @hailsham. Thank you for your valuable contributions!

๐Ÿ”— Full Changelog

For a full list of updates, visit our Full Changelog.

Thank you to every contributor, old and new. Your dedication is shaping OpenCompass into a more robust and versatile tool. ๐Ÿ™Œ ๐ŸŽ‰


Remember to star ๐ŸŒŸ our GitHub repository if OpenCompass aids your research and development! Your support and feedback are crucial for our continuous improvement.

0.2.2.rc1

3 months ago

Provide with more parsed datasets:

OpenCompassData-core-20240207.zip OpenCompassData-complete-20240207.zip

Important updates compared to previous version are as follow:

  • Subjective: Add AlignBench, MTBench
  • Agent: Add T-Eval
  • Medicine: Add MedBench
  • Code: Add HumanEval-X, DS-1000
  • Finance: Add FinanceIQ
  • Law: Update LawBench Evaluation Assets

OpenCompassData-core-20240207.zip

AGIEval ARC BBH ceval CLUE cmmlu
commonsenseqa drop FewCLUE flores_first100 GAOKAO-BENCH gsm8k
hellaswag humaneval lambada LCSTS math mbpp
mmlu nq openbookqa piqa race siqa
strategyqa summedits SuperGLUE TheoremQA triviaqa tydiqa
winogrande xstory_cloze Xsum

OpenCompassData-complete-20240207.zip

AGIEval anli ARC BBH CDME ceval
cibench_dataset cleva clozeTest-maxmin CLUE CMB cmmlu
commonsenseqa commonsenseqa_cn crowspairs_cn drop ds1000_data FewCLUE
FinanceIQ flores200_dataset flores_first100 FunctionalMT game24 GAOKAO-BENCH
gpqa gsm8k hellaswag humaneval humaneval_cn humaneval_multipl-e
humanevalx HungarianExamMath InfiniteBench lambada lanQ lawbench
LCSTS math math401 mbpp mbpp_cn mbpp_plus
MedBench mmlu MNIST NPHardEval nq nq_cn
nq-open openbookqa piqa py150 qabench race
scibench siqa SQuAD2.0 strategyqa alignment_bench mtbench
summedits SuperGLUE svamp teval TheoremQA triviaqa
tydiqa winogrande xiezhi xlsum xstory_cloze Xsum

0.2.1

4 months ago

We're thrilled to announce OpenCompass v0.2.1, loaded with new datasets, features, and vital fixes. This release is a testament to our ongoing commitment to enhancing user experience and broadening research capabilities.

๐ŸŒŸ Highlights:

  • Add Agent and Code datasets: Diverse new datasets like GPQA, mastermath2024v1, and more, significantly expanding the scope of OpenCompass.
  • Support Different JudgeLLM Subjective Evaluation: Providing more choice when choose judgellms.
  • Support Needle in Haystack: Support Needle in Haystack for longtext evaluation.
  • Add VLLM Evaluation: We support VLLM inference and evaluation.

Here's what's new:

๐Ÿš€ New Features:

  • ๐Ÿ“ฆ Dataset Expansion:

    • Added rwkv-5-3b model (#666)
    • Integration of diverse datasets including GPQA, Creationbench, and more.
    • Support for new datasets like mastermath2024v1, mbpp_plus, and sanitized_mbpp (#744, #770, #745)
  • ๐Ÿ›  Functional Enhancements:

    • Subjective evaluation improvements (#692, #724)
    • Updated python action, slurm, and docker docs (#694, #718)
    • Turbomind API support and Qwen API integration (#693, #735)
  • ๐Ÿ“– Documentation Updates:

    • Updated contamination, alignmentbench, and other docs for better clarity (#698, #707)
    • Fixed dead links and typos in various documents (#455, #773, #774)

๐Ÿ› Bug Fixes:

  • Addressed various issues including those in alignmentbench, configs, and postprocess scripts.
  • Fixed bugs concerning subjective evaluation and EOS string detection.
  • Quick fixes for improved performance and reliability.

๐ŸŽ‰ Welcome New Contributors:

  • A warm welcome to our first-time contributors:
    • @BBuf, @DseidLi, @Skyfall-xzz, @RunningLeon, @zehuichen123, @AllentDan, @Connor-Shen, @Francis-llgg, @hzhwcmhf, @ChrisLiu6, @yanyc428, @tpoisonooo, @jiangjin1999

๐Ÿ”— Full Changelog

For a full list of updates, visit our Full Changelog.

Thank you to every contributor, old and new. Your dedication is shaping OpenCompass into a more robust and versatile tool. ๐Ÿ™Œ ๐ŸŽ‰


Remember to star ๐ŸŒŸ our GitHub repository if OpenCompass aids your research and development! Your support and feedback are crucial for our continuous improvement.

0.2.0

4 months ago

๐ŸŒŸ Highlights

  • ๐Ÿ›  Data Contamination Analysis: A novel feature for analyzing and ensuring the integrity of dataset inputs.
  • ๐Ÿง  Enhanced Subjective Evaluation: Implementation of a new subjective judgement system, providing more nuanced and accurate evaluations.
  • ๐Ÿš€ Chat Style Inferencer Support: Introduction of a new chat style inferencer, enhancing interactive capabilities.
  • ๐ŸŒ Multilingual Features: Expansion to support Chinese versions of commonsenseqa, crowspairs, and nq datasets.
  • ๐Ÿ“Š New Datasets Integration: Addition of wikibench, rolebench, and updated versions of gsm8k and MathBench datasets for broader research applications.
  • ๐Ÿ›  Enhancements and Bug Fixes: Numerous improvements including a new subjective judgement system and updates in MathBench CodeInterpreter.
  • ๐Ÿ“ Documentation and API Updates: Comprehensive updates to README and API interfaces for better user guidance and experience.

๐Ÿš€ New Features & Enhancements

  • Support for chat style inferencer, offering a more dynamic interaction model (#643).
  • Addition of Chinese versions for key datasets: commonsenseqa, crowspairs, and nq (#144).
  • Introduction of the wikibench dataset, providing a new benchmark for knowledge-based tasks (#655).
  • Updated gsm8k and MathBench configurations for enhanced performance and accuracy (#652, #657).
  • Addition of rolebench dataset, expanding the range of evaluative scenarios (#633).
  • Implementation of new subjective judgement criteria for improved assessment accuracy (#660).
  • Integration of advanced models like qwen-1.8b/72b and deepseek-7b/67b in the platform's configuration (#672).
  • Launch of Data Contamination Analysis as a new feature, enhancing data integrity checks (#639).

๐Ÿ›  Improvements & Fixes

  • Removal of colossalai dependency to streamline operations (#645).
  • Resolution of various bugs including hellaswag_ppl_47bff9 and standard deviation summarizer issues (#648, #675).
  • Update and fix of the MathBench CodeInterpreter and related bugs (#657).
  • Enhancement of API interface for improved functionality and user experience (#681).

๐Ÿ“š Documentation Updates

  • Updated README for clearer guidance and information (#682).
  • Documentation and docstring updates for accuracy and comprehensiveness (#684).

๐ŸŽŠ New Contributors

  • A warm welcome to new contributors @rolellm, @liyucheng09, and @xmshi-trio. Your contributions have significantly enriched OpenCompass!

๐Ÿ”— Full Changelog

Thank you to all contributors for your hard work and dedication. OpenCompass v0.2.0 marks another step forward in our journey, bringing enhanced features and capabilities to the community. Let's continue to innovate and expand the horizons of OpenCompass! ๐ŸŽ‰๐ŸŒ๐Ÿ’ก

0.1.9

5 months ago

๐ŸŒŸ Highlights

  • ๐Ÿš€ New API Integrations: A leap forward with the addition of multiple new APIs, including Baidu, Moonshot, Sensetime, and more, broadening the scope and capabilities of OpenCompass.
  • ๐Ÿ”ต Circular Evaluation Feature: Introducing Circular Eval, an enhancement for comprehensive and dynamic evaluations within the platform.
  • ๐Ÿค– Turbomind Inference Integration: Integration of Turbomind inference through its RPC API, enhancing the platform's inferencing capabilities.

๐Ÿš€ New Features & Enhancements

  • Model & API Development: Explore new capabilities with DataCanvas Alaya LM, Lightllm API, 360API, and enhanced Turbomind Python API integration (#612, #613, #601, #484).
  • Circular Evaluation Implementation: Elevate your evaluation methods with the newly added Circular Eval feature, offering a more nuanced and detailed analysis capability (#610).
  • Rich Dataset Additions: Enrich your research with new datasets - FinanceIQ, SVAMP, GSM_Hard, and updated Mathbench for diverse applications (#596, #604, #619, #580, #607).

๐Ÿ›  Improvements & Fixes

  • Subjective Evaluation Bug Fixes: Improved accuracy in subjective evaluations (#589).
  • Dataset and Feature Fixes: Resolving issues in CMB dataset, various feature enhancements, and fixes (#587, #592, #615, #632).

๐Ÿ“š Documentation Updates

  • README & FAQ Enhancements: Updated for better clarity and assistance (#582, #622, #628, #629).
  • Typo and Spelling Corrections: Ensuring accuracy and professionalism in documentation (#594, #637).

๐ŸŽŠ New Contributors

Welcoming new contributors to the OpenCompass family!

  • @rahidzeynal, @Sniper970119, @ZhangRaymond, @HunterKruger, @helloyongyang, and @Yggdrasill7D6. Your contributions are greatly appreciated!

What's Changed

Explore the detailed changes in the full changelog.

Thank you to all the contributors for this release. Your dedication and hard work continue to enhance OpenCompass, making it an ever-evolving and dynamic tool for the community. Let's dive into the new possibilities with OpenCompass v0.1.9! ๐ŸŽ‰๐Ÿงฎ๐Ÿ’ป

0.1.8

5 months ago

๐Ÿ”ฅ Highlights

  • ๐ŸŒ New Dataset Integrations: Expanding our dataset collection with Tabmwp, py150, maxmin, and more.
  • ๐Ÿ’ก Compatibility and API Support: Enhancements with MiniGPT-4 and MiniMax API, and support for Xunfei API.
  • ๐Ÿ› ๏ธ Local Environment and Debugging Improvements: Streamlined local debugging and usage of datasets from local paths.

๐Ÿš€ New Features & Enhancements

  • Datasets Galore: Unleash the power of new datasets including Tabmwp, py150, maxmin, and updates to existing ones like Mathbench for broader research scope (#505, #546, #562).
  • MiniGPT-4 & MiniMax API Compatibility: Stay up-to-date with the latest versions and extended API support (#539, #548).
  • Xunfei API Model & Update: Explore new possibilities with the integration and update of Xunfei API (#547, #572).

๐Ÿ›  Improvements & Fixes

  • Local Debug Mode Restriction: Enhanced resource management in local debug mode (#522 by @yingfhu).
  • Various Fixes and Updates: Addressing typos, import issues, and log redirections for smoother operation (#520, #549, #551, #555, #564).

๐Ÿ“š Documentation Updates

  • Enhanced README and FAQs: Get all your queries answered and understand OpenCompass better with updated documentation (#523, #531, #535, #540, #567).
  • Typo Corrections: Ensuring clarity and accuracy in our documentation (#530, #533).

๐ŸŽŠ New Contributors

A warm welcome to the new members of the OpenCompass community!

  • @Sanster, @ayushrakesh, @HimanshuMahto, @shresthasurav, @bittersweet1999, and @jingmingzhuo. Thank you for your valuable contributions!

Changelog

Explore the detailed changes in the full changelog.

Thank you to everyone who contributed to this release. Your efforts are immensely appreciated and are helping to make OpenCompass a more robust and versatile tool. Let's continue to push the boundaries with OpenCompass v0.1.8! ๐Ÿš€๐ŸŒ๐Ÿ› ๏ธ