OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
Provide with more parsed datasets:
OpenCompassData-core.zip
AGIEval | ARC | BBH | ceval | CLUE | cmmlu |
---|---|---|---|---|---|
commonsenseqa | drop | FewCLUE | flores_first100 | GAOKAO-BENCH | gsm8k |
hellaswag | humaneval | lambada | LCSTS | math | mbpp |
mmlu | nq | openbookqa | piqa | race | siqa |
strategyqa | summedits | SuperGLUE | TheoremQA | triviaqa | tydiqa |
winogrande | xstory_cloze | Xsum |
OpenCompassData-complete.zip
AGIEval | anli | ARC | BBH | ceval | cleva |
---|---|---|---|---|---|
CLUE | CMB | cmmlu | commonsenseqa | drop | ds1000 |
FewCLUE | flores200_dataset | flores_first100 | game24 | GAOKAO-BENCH | govrep |
gsm8k | hellaswag | humaneval | jigsawmultilingual | lambada | lawbench |
LCSTS | math | mbpp | mmlu | narrativeqa | nq |
openbookqa | piqa | QASPER | race | realtoxicprompts | scibench |
siqa | SQuAD2.0 | strategyqa | summedits | SummScreen | SuperGLUE |
TheoremQA | triviaqa | triviaqa-rc | tydiqa | winogrande | xiezhi |
xlsum | xstory_cloze | Xsum | FinanceIQ |
do_sample=False
for precise control over sampling behavior in HF model.do_sample=False
in HF Model for correct sampling behavior (#506 by @Leymore).We're delighted to welcome new contributors to the OpenCompass community!
do_sample=False
in HF model by @Leymore in https://github.com/open-compass/opencompass/pull/506
The full list of changes is available in the changelog. A massive thank you to all the community members who contributed to this release. Your efforts are propelling OpenCompass further! ๐
Embark on new explorations with OpenCompass v0.1.7!
Welcome to the newest version of OpenCompass! v0.1.6 brings forth exciting dataset additions, crucial fixes, and enhanced documentation. We're confident that this release will provide a better and smoother experience for all users.
Dive into the details:
๐ฆ Datasets Galore:
๐ Utilities and Enhancements:
๐ Documentation and Syncs:
clp
errors and support for bs>1 (#439)jieba
rouge (#459, #467)Huge thanks to all contributors! Your constant efforts make OpenCompass better with each release. ๐ ๐
For a detailed overview, check out our Full Changelog.
If you find OpenCompass beneficial, kindly star ๐ our GitHub repository! We value your feedback, reviews, and continued support.
Dive into our newly improved features, bug fixes, and most notably our enhanced dataset support, coming together to refine your experience.
ds1000
, promptbench
, antropics evals
, kaoshi
, and many more, making OpenCompass more versatile than ever.Explore the detailed changes:
get_started.md
and fixed naming issues (#377, #380)longeval
(#389)has_image
fix to scienceqa (#391)visualglm
(#424)mlugowl
llamaadapter introduced (#405)A heartfelt welcome to our first-time contributors:
@wangxidong06 (First PR) @so2liu (First PR) @HoBeedzc (First PR) @CuteyThyme (First PR) @chenbohua3 (First PR)
To all contributors, old and new, thank you for continually enhancing OpenCompass! Your efforts are deeply valued. ๐ ๐
If you love OpenCompass, don't forget to star ๐ our GitHub repository! Your feedback, reviews, and contributions immensely help in shaping the product.
Full Changelog: https://github.com/open-compass/opencompass/compare/0.1.4...0.1.5
OpenCompass v0.1.4 is here with an array of features, documentation improvements, and key fixes! Dive in to see what's in store:
More Tools and Features: OpenCompass continues to expand its repertoire with the addition of tools like update suffix, codellama, preds collection tools, qwen & qwen-chat support, and more. Not forgetting our attention to Otter and the MMBench Evaluation! Documentation Facelift: We've made several updates to our documentation, ensuring it stays relevant, user-friendly, and aesthetically pleasing. Essential Bug Fixes: Weโve tackled numerous bugs, especially those concerning tokens, triviaqa, nq postprocess, and qwen config. Enhancements: From simplifying execution logic to suppressing warnings, weโre always on the lookout for ways to improve our product.
Dive deeper to learn more:
๐ฆ Tools and Integrations:
๐ Utilities and Functionality:
Thank you to all our contributors for this release, with a special shoutout to our new contributors:
@Luodian (First PR) @ZhangYuanhan-AI (First PR) @HAOCHENYE (First PR)
Thank you to the entire community for pushing OpenCompass forward. Make sure to star ๐ our GitHub repository if OpenCompass aids your endeavors! We treasure your feedback and contributions.
For an exhaustive list of changes, kindly check our Full Changelog.
OpenCompass keeps getting better! v0.1.3 brings a variety of enhancements, new features, and crucial fixes. Hereโs a summary of what we've packed into this release:
Extended Dataset Support: OpenCompass now integrates a broader range of public datasets, including but not limited to adv_glue
, codegeex2
, Humanevalx
, SEED-Bench
, LongBench
, and LEval
. We aim to provide extensive coverage to cater to a variety of research needs.
Utility Additions: From the inclusion of multi-modal evaluations on MME benchmark to the Tree-of-Thought method, this release comes packed with functionality enhancements.
Bug Extermination: Your feedback helps us grow. Weโve squashed a series of bugs to improve your experience.
More Evaluation Benchmark for Multimodal Models. We support another 10 evaluation benchmarks for multimodal models, including COCO Caption and ScienceQA, and provide corresponding evaluation code.
Let's delve deeper into what's new:
๐ฆ Extended Dataset Support:
adv_glue
dataset focused on adversarial robustness (#205).codegeex2
, Humanevalx
(#210).LEval
dataset (#266).๐ Utilities and Functionality:
llama2
native implementations (#235).Thank you to all our contributors for this release, with a special shoutout to our new contributors:
@yyk-wew (First PR) @fangyixiao18 (First PR) @philipwangOvO (First PR) @cdpath (First PR)
Thank you to our dedicated contributors for making OpenCompass even more comprehensive and user-friendly! ๐ ๐
Remember to star ๐ our GitHub repository if you find OpenCompass helpful! Your feedback and contributions are invaluable.
For a complete list of changes, please refer to our Full Changelog.
This release continues the evolution of OpenCompass, bringing a mix of new features, optimizations, documentation improvements, and bug fixes.
๐ Leaderboard: The evaluation results of Qwen-7B, XVERSE-13B, LLaMA-2, and GPT-4 has been posted to our leaderboard. Now it's also possible to conduct model comparison online. We hope this feature offers deeper insights!
๐ Datasets: Introduction of Xiezhi, SQuAD2.0, ANLI, LEval datasets, and more for diverse applications. (#101, #192) Add datasets related to safety to collections. [#185]
๐ญNew modality: Support for MMBench is introduced, and the evaluation of multi-modal models is on the way! (#56 ,#161) Besides, Intern language model is introduced. (#51)
โ๏ธEnhancement: Several enhancements on OpenAI models, including key deprecation, temperature setting, etc. [#121] [#128] Supporting multiple tasks on one GPU, filtering messages by levels, and more. [#148] [#187]
๐ Documentation: Comprehensive updates and fixes across READMEs, issue templates, prompt docs, metric documentation, and more.
๐ ๏ธ Bug Fixes: Including seed fixes in HFEvaluator, addressing issues in AGIEval multiple choice questions, and more. [#122] [#137]
Thank you to all our contributors for this release, with a special shoutout to our new contributors:
@go-with-me000 (First Contribution) @anakin-skywalker-Joseph (First Contribution) @zhouzaida (First Contribution) @dependabot (First Contribution)
Full Changelog: https://github.com/InternLM/opencompass/compare/0.1.1...0.1.2
Add some more datasets.
First release with some datasets.