Open LLM Leaderboard Report Save Abandoned

Weekly visualization report of Open LLM model performance based on 4 metrics.

Project README

Open LLM Leaderboard Report (Weekly Update)

Latest update: 20230619

This repository offers weekly visualizations that showcase the performance of open-source Large Language Models (LLMs), based on evaluation metrics sourced from Hugging Face's Open-LLM-Leaderboard. The visualizations are refreshed weekly to ensure up-to-date information.

Source data

You can refer to this CSV file for the underlying data used for visualization. Raw data is 2d-list formatted JSON file.

Revision with analysis

Revision Summary

Run

Set using config.py

git clone https://github.com/dsdanielpark/Open-LLM-Leaderboard-Report
cd Open-LLM-Leaderboard-Report
python main.py

Summary

Parameters: The largest parameter model achieved so far has been converted to 100 for percentage representation.

Average Ranking

What is Open-LLM-Leaderboard?

https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

The Open LLM Leaderboard tracks, ranks, and evaluates large language models and chatbots. It evaluates models based on benchmarks from the Eleuther AI Language Model Evaluation Harness, covering science questions, commonsense inference, multitask accuracy, and truthfulness in generating answers.

The benchmarks aim to test reasoning and general knowledge in different fields using 0-shot and few-shot settings.

Evaluation is performed against 4 popular benchmarks:

  • AI2 Reasoning Challenge (25-shot) - a set of grade-school science questions.
  • HellaSwag (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
  • MMLU (5-shot) - a test to measure a text model’s multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
  • TruthfulQA (0-shot) - a benchmark to measure whether a language model is truthful in generating answers to questions.

It is chosed benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings.

Top 5

Top 10

Performance by Metric

Average

HellaSwag (10-shot)

MMLU (5-shot)

AI2 Reasoning Challenge (25-shot)

TruthfulQA (0-shot)

Parameters

Parameters: The largest parameter model achieved so far has been converted to 100 for percentage representation.

Citation

@software{Open-LLM-Leaderboard-Report-2023,
  author = {Daniel Park},
  title = {{Open-LLM-Leaderboard-Report}},
  url = {https://github.com/dsdanielpark/Open-LLM-Leaderboard-Report},
  year = {2023}
}

Reference

[1] https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

Open Source Agenda is not affiliated with "Open LLM Leaderboard Report" Project. README Source: dsdanielpark/Open-LLM-Leaderboard-Report

Open Source Agenda Badge

Open Source Agenda Rating