A comprehensive versioned dataset of the repositories and relevant related metadata about public projects hosted on GitHub related to the 2019 Novel Coronavirus and associated COVID-19 disease.
We have received a number of enquiries from researchers and the community surrounding open collaboration on projects on the platform related to the disease COVID-19 caused by the SARS-CoV-2 virus. Many projects, ordered by star count, can be found using the covid-19 topic on GitHub, however, discovery of other important projects is difficult due to differences in the way users self identify their work. There are some great
awesomelists such as https://github.com/soroushchehresa/awesome-coronavirus documenting useful projects but they are not time versioned.
As this is such an important topic to many people at this time, we've decided to do regular, versioned, extracts of data from our systems and make them available to researchers under an open license to allow for deeper analysis of these public projects from teams outside of GitHub.
If you have created any interesting research based on this data we would love to hear about it so that we can help ensure it becomes more prominently featured. Please open a PR against the file
USER_SUBMISSIONS.md with a link to your research. We are especially interested in highlighting the most promising and impactful projects in need of community help and support.
Open source is bigger than any company or community. The dataset is released under CC0-1.0 for anyone to use and learn from.
There are two main sets of files, released via
json formats for public consumption in the directory
A comprehensive data dictionary that explains the contents of these files is here. The files are sorted in descending order by the count of distinct contributors at the time of extract.
The files have been versioned based on a weekly snapshot of identified repositories from the week of
We will update this repository with new data files on a monthly basis, generally on the first Tuesday of a month. We will revisit this each month and provide an update on continuing this commitment.
Rather than relying on any one GitHub topic to identify potential COVID-19 related projects, the data set is produced using a more comprehensive set of search criteria to identify projects likely to be COVID-19 related.
Note: This has the potential to include a small number of false positives however we figured we were better to cast a wide net and allow consumers of the data to perform additional cleaning if they desire.
Furthermore, since this data is versioned based on the week the repo was initially created, there may exist data that are included for repos that were originally
public that have been made
private and are currently inaccessible.
The following parts of public metadata are currently being used to identify public projects (those licensed and not) as COVID-19 related:
Search terms against these metadata include variations of:
The data and associated documentation in this repo are open data released under the very permissive CC0-1.0 public domain dedication. However, please understand:
license_namefield in the extract, and visit individual project repositories for details).