Opencompass Versions Save

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

0.1.8.rc1

6 months ago

Provide with more parsed datasets:

OpenCompassData-core.zip

AGIEval ARC BBH ceval CLUE cmmlu
commonsenseqa drop FewCLUE flores_first100 GAOKAO-BENCH gsm8k
hellaswag humaneval lambada LCSTS math mbpp
mmlu nq openbookqa piqa race siqa
strategyqa summedits SuperGLUE TheoremQA triviaqa tydiqa
winogrande xstory_cloze Xsum

OpenCompassData-complete.zip

AGIEval anli ARC BBH ceval cleva
CLUE CMB cmmlu commonsenseqa drop ds1000
FewCLUE flores200_dataset flores_first100 game24 GAOKAO-BENCH govrep
gsm8k hellaswag humaneval jigsawmultilingual lambada lawbench
LCSTS math mbpp mmlu narrativeqa nq
openbookqa piqa QASPER race realtoxicprompts scibench
siqa SQuAD2.0 strategyqa summedits SummScreen SuperGLUE
TheoremQA triviaqa triviaqa-rc tydiqa winogrande xiezhi
xlsum xstory_cloze Xsum FinanceIQ

0.1.7

6 months ago

๐ŸŒŸ Highlights

  • Sampling Control: Enforce do_sample=False for precise control over sampling behavior in HF model.
  • Subjective Evaluation Guidance: Enhanced evaluation mechanisms for a more comprehensive understanding and analysis of models.
  • Eval Details Dump: Now, evaluation details for certain datasets are available for a deeper insight and analysis.

๐Ÿš€ New Features

  • Eval Details Dump for a deeper insight on each test cases. (#517 by @Leymore).
  • MathBench Dataset and Circular Evaluator to bolster mathematical benchmarking capabilities (#408 by @liushz).
  • Support for Math/GMS8k Agent Config providing new avenues for configuration (#494 by @yingfhu).
  • Default Example Summarizer making summarization tasks more accessible (#508 by @Leymore).
  • Model Keyword Arguments Setting for HF Model enhancing customization (#507 by @Leymore).

๐Ÿ›  Improvements & Refactorings

  • Local API Speed Up with fixed concurrent users for better performance (#497 by @yingfhu).
  • Local Runner Support for Windows expanding the platform support (#515 by @yingfhu).
  • Sync with Internal Implements for updated and refined functionalities (#488 by @Leymore).

๐Ÿ› Bug Fixes

  • Summary Default Fix for accurate summarization (#483 by @Leymore).
  • Enforce do_sample=False in HF Model for correct sampling behavior (#506 by @Leymore).
  • Invalid Link Fix in documentation for better navigation (#499 by @yingfhu).

๐Ÿ“š Documentation & Maintenance

  • Subjective Comparison Introduction for organized documentation (#510 by @frankweijue).
  • README Update for better project understanding (#496 by @saakshii12).
  • Owner Update for correct ownership information (#504 by @Leymore).

๐ŸŽŠ New Contributors

We're delighted to welcome new contributors to the OpenCompass community!

  • @saakshii12 made their first contribution in #496.
  • @frankweijue stepped in with their first contribution in #510.

Changelog

The full list of changes is available in the changelog. A massive thank you to all the community members who contributed to this release. Your efforts are propelling OpenCompass further! ๐Ÿ™Œ

Embark on new explorations with OpenCompass v0.1.7!

0.1.6

7 months ago

Welcome to the newest version of OpenCompass! v0.1.6 brings forth exciting dataset additions, crucial fixes, and enhanced documentation. We're confident that this release will provide a better and smoother experience for all users.

๐Ÿ†• Highlights:

  • Dataset Enrichment: Multiple additions, especially from the GLUE suite, to provide more versatility and better testing capabilities.
  • Documentation Revamp: Fixed dead links and updated the 'get_started' section to assist our users in navigating OpenCompass effortlessly.
  • Introducing New Faces: A warm welcome to our newest contributors. Your dedication and contributions are pivotal to our progress!

Dive into the details:

๐ŸŒŸ New Features:

  • ๐Ÿ“ฆ Datasets Galore:

    • Introduced WikiText-2&103 dataset (#397)
    • GLUE dataset additions:
    • Lawbench dataset addition (#460)
  • ๐Ÿ›  Utilities and Enhancements:

    • Re-implementation of ceval load dataset (#446)
    • Integrated turbomind inference through its RPC API (#414)
    • Moved fix_id_list to Retriever for better code organization (#442)
  • ๐Ÿ“– Documentation and Syncs:

    • Updated dataset list and get_started section (#437, #435)
    • Resolved dead links in the readme (#455)
    • Enhancements to LongEval and subjective evaluation (#443, #475)

๐Ÿ› Bug Fixes:

  • Addressed issues related to clp errors and support for bs>1 (#439)
  • Resolved issues concerning jieba rouge (#459, #467)
  • Enhanced EOS string detection for splitting (#477)
  • Various other fixes for optimal performance.

๐ŸŽ‰ Welcome New Contributors:

  • A big shout-out to our new contributors:

Huge thanks to all contributors! Your constant efforts make OpenCompass better with each release. ๐Ÿ™Œ ๐ŸŽ‰

Changelog

For a detailed overview, check out our Full Changelog.


If you find OpenCompass beneficial, kindly star ๐ŸŒŸ our GitHub repository! We value your feedback, reviews, and continued support.

0.1.5

7 months ago

Dive into our newly improved features, bug fixes, and most notably our enhanced dataset support, coming together to refine your experience.

๐Ÿ†• Highlights:

  • Boosted Dataset Integrations: This release paves the way for support on numerous datasets like ds1000, promptbench, antropics evals, kaoshi, and many more, making OpenCompass more versatile than ever.
  • More Evaluation Types: We starts integrating subjective and agent-adied LLM evaluation into OpenCompass. Stay tuned!

Explore the detailed changes:

๐ŸŒŸ New Features:

  • ๐Ÿ“ฆ New Datasets and Features:
    • ds1000 dataset support (#395)
    • promptbench dataset implementation (#239)
    • antropics evals dataset support (#422)
    • kaoshi dataset introduction (#392)
    • Initial support for subjective evaluation (#421)
    • Support for GSM8k evaluation tools (#277)
    • scibench evaluation added (#393)

๐Ÿ“– Documentation:

  • News updates and introduction figure in README (#375, #413)
  • Updated get_started.md and fixed naming issues (#377, #380)
  • New FAQ section added (#384)
  • README addition in longeval (#389)
  • Multimodal documentation introduced (#334)

๐Ÿ› ๏ธ Bug Fixes:

  • Addressed a potential OOM issue (#387)
  • Added has_image fix to scienceqa (#391)
  • Resolved performance issues of visualglm (#424)
  • Debug logger fix for summarizer (#417)
  • Addressed errors in keep keys (#431)

โš™ Enhancements and Refactors:

  • Refinement in docs and codes for better user guidance (#409)
  • Custom summarizer argument added in CLI mode (#411)
  • mlugowl llamaadapter introduced (#405)
  • Enhanced mm models support on public datasets (#412)
  • Customized config path support (#423)

๐ŸŽ‰ New Contributors:

A heartfelt welcome to our first-time contributors:

@wangxidong06 (First PR) @so2liu (First PR) @HoBeedzc (First PR) @CuteyThyme (First PR) @chenbohua3 (First PR)

To all contributors, old and new, thank you for continually enhancing OpenCompass! Your efforts are deeply valued. ๐Ÿ™Œ ๐ŸŽ‰

If you love OpenCompass, don't forget to star ๐ŸŒŸ our GitHub repository! Your feedback, reviews, and contributions immensely help in shaping the product.

Changelog

Full Changelog: https://github.com/open-compass/opencompass/compare/0.1.4...0.1.5

0.1.4

8 months ago

OpenCompass v0.1.4 is here with an array of features, documentation improvements, and key fixes! Dive in to see what's in store:

๐Ÿ†• Highlights:

More Tools and Features: OpenCompass continues to expand its repertoire with the addition of tools like update suffix, codellama, preds collection tools, qwen & qwen-chat support, and more. Not forgetting our attention to Otter and the MMBench Evaluation! Documentation Facelift: We've made several updates to our documentation, ensuring it stays relevant, user-friendly, and aesthetically pleasing. Essential Bug Fixes: Weโ€™ve tackled numerous bugs, especially those concerning tokens, triviaqa, nq postprocess, and qwen config. Enhancements: From simplifying execution logic to suppressing warnings, weโ€™re always on the lookout for ways to improve our product.

Dive deeper to learn more:

๐ŸŒŸ New Features:

๐Ÿ“ฆ Tools and Integrations:

  • Application of update suffix tool (#280).
  • Support for codellama and preds collection tools (#335).
  • Addition of qwen & qwen-chat support (#286).
  • Introduction of Otter to OpenCompass MMBench Evaluation (#232).
  • Support for LLaVA and mPLUG-Owl (#331).

๐Ÿ›  Utilities and Functionality:

  • Enhanced sample count in prompt_viewer (#273).
  • Ignored ZeroRetriever error when id_list provided (#340).
  • Improved default task size (#360).

๐Ÿ“ Documentation:

  • Updated communication channels: WeChat and Discord (#328).
  • Documentation theme revamped for a fresh look (#332).
  • Detailed documentation for the new entry script (#246).
  • MMBench documentation updated (#336).

๐Ÿ› ๏ธ Bug Fixes:

  • Resolved issue when missing both pad and eos token (#287).
  • Addressed triviaqa & nq postprocess glitches (#350).
  • Fixed qwen configuration inaccuracies (#358).
  • Default value added for zero retriever (#361).

โš™ Enhancements and Refactors:

  • Streamlined execution logic in run.py and ensured temp files cleanup (#337).
  • Suppressed unnecessary warnings raised by get_logger (#353).
  • Import checks of multimodal added (#352).

๐ŸŽ‰ New Contributors:

Thank you to all our contributors for this release, with a special shoutout to our new contributors:

@Luodian (First PR) @ZhangYuanhan-AI (First PR) @HAOCHENYE (First PR)

Thank you to the entire community for pushing OpenCompass forward. Make sure to star ๐ŸŒŸ our GitHub repository if OpenCompass aids your endeavors! We treasure your feedback and contributions.


Changelog

For an exhaustive list of changes, kindly check our Full Changelog.

0.1.3

8 months ago

OpenCompass keeps getting better! v0.1.3 brings a variety of enhancements, new features, and crucial fixes. Hereโ€™s a summary of what we've packed into this release:

๐Ÿ†• Highlights:

Extended Dataset Support: OpenCompass now integrates a broader range of public datasets, including but not limited to adv_glue, codegeex2, Humanevalx, SEED-Bench, LongBench, and LEval. We aim to provide extensive coverage to cater to a variety of research needs. Utility Additions: From the inclusion of multi-modal evaluations on MME benchmark to the Tree-of-Thought method, this release comes packed with functionality enhancements. Bug Extermination: Your feedback helps us grow. Weโ€™ve squashed a series of bugs to improve your experience. More Evaluation Benchmark for Multimodal Models. We support another 10 evaluation benchmarks for multimodal models, including COCO Caption and ScienceQA, and provide corresponding evaluation code.

Let's delve deeper into what's new:

๐ŸŒŸ New Features:

๐Ÿ“ฆ Extended Dataset Support:

  • Introduction of other public datasets (#206, #214).
  • Support for adv_glue dataset focused on adversarial robustness (#205).
  • Added codegeex2, Humanevalx (#210).
  • Integration of SEED-Bench (#203).
  • LongBench support (#236).
  • Reconstruct LEval dataset (#266).
  • Support another 10 public evaluation benchmarks for multimodal models (#214)

๐Ÿ›  Utilities and Functionality:

  • Launch script added for ease of operations (#222).
  • Multi-modal evaluation on MME benchmark (#197).
  • Support for visualglm and llava on MMBench evaluation (#211).
  • Tree-of-Thought method introduced (#173).
  • Introduction of llama2 native implementations (#235).
  • Flamingo and Claude support added (#258, #253).

๐Ÿ“ Documentation:

  • Navigation bar language type updated for better clarity (#212).
  • News updates for keeping users informed (#241, #243).
  • Summarizer documentation added (#231).

๐Ÿ› ๏ธ Bug Fixes:

  • Addressed an issue with multiple rounds of inference using mm_eval (#201).
  • Miscellaneous fixes such as name adjustments, requirements, and bin_trim corrections (#223, #229, #237).
  • Local runner debug issue fixed (#238).
  • Resolved bugs for PeftModel generate (#252).

โš™ Enhancements and Refactors:

  • Refactored instructblip for better performance and readability (#227).
  • Improved crowspairs postprocess (#251).
  • Optimization to use sympy only when necessary (#255).

๐ŸŽ‰ New Contributors:

Thank you to all our contributors for this release, with a special shoutout to our new contributors:

@yyk-wew (First PR) @fangyixiao18 (First PR) @philipwangOvO (First PR) @cdpath (First PR)

Thank you to our dedicated contributors for making OpenCompass even more comprehensive and user-friendly! ๐Ÿ™Œ ๐ŸŽ‰

Remember to star ๐ŸŒŸ our GitHub repository if you find OpenCompass helpful! Your feedback and contributions are invaluable.


Change log

For a complete list of changes, please refer to our Full Changelog.

0.1.2

9 months ago

This release continues the evolution of OpenCompass, bringing a mix of new features, optimizations, documentation improvements, and bug fixes.

๐Ÿ†•Highlights

๐Ÿ† Leaderboard: The evaluation results of Qwen-7B, XVERSE-13B, LLaMA-2, and GPT-4 has been posted to our leaderboard. Now it's also possible to conduct model comparison online. We hope this feature offers deeper insights!

๐Ÿ“Š Datasets: Introduction of Xiezhi, SQuAD2.0, ANLI, LEval datasets, and more for diverse applications. (#101, #192) Add datasets related to safety to collections. [#185]

๐ŸŽญNew modality: Support for MMBench is introduced, and the evaluation of multi-modal models is on the way! (#56 ,#161) Besides, Intern language model is introduced. (#51)

โš™๏ธEnhancement: Several enhancements on OpenAI models, including key deprecation, temperature setting, etc. [#121] [#128] Supporting multiple tasks on one GPU, filtering messages by levels, and more. [#148] [#187]

๐Ÿ“ Documentation: Comprehensive updates and fixes across READMEs, issue templates, prompt docs, metric documentation, and more.

๐Ÿ› ๏ธ Bug Fixes: Including seed fixes in HFEvaluator, addressing issues in AGIEval multiple choice questions, and more. [#122] [#137]

๐ŸŽ‰ New Contributors

Thank you to all our contributors for this release, with a special shoutout to our new contributors:

@go-with-me000 (First Contribution) @anakin-skywalker-Joseph (First Contribution) @zhouzaida (First Contribution) @dependabot (First Contribution)

Changelog

Full Changelog: https://github.com/InternLM/opencompass/compare/0.1.1...0.1.2

0.1.1

9 months ago

Add some more datasets.

  • AGIEval
  • anli
  • cmmlu
  • jigsawmultilingual
  • realtoxicprompts
  • SQuAD2.0
  • TheoremQA
  • triviaqa
  • xiezhi
  • Xsum

0.1.0

10 months ago

First release with some datasets.

  • ARC
  • BBH
  • ceval
  • CLUE
  • FewCLUE
  • GAOKAO-BENCH
  • LCSTS
  • math
  • mbpp
  • mmlu
  • nq
  • summedits
  • SuperGLUE