Java based tools for extracting information from GitHub and git repositories into a graph database
Copyright (c) 2011-2012 by IBM and the University of Nebraska-Lincoln
By Patrick Wagstrom <[email protected]> and Corey Jergenson <[email protected]>
This project is from a joint research project between IBM Research and the University of Nebraska-Lincoln. Under the terms of that agreement all output from this project must be distributed under the terms of the Apache License.
This project uses Apache Maven to manage all dependencies and versioning. The simplest way to get going is to run the following command:
mvn clean compile package assembly:single
GitMiner has many different properties that can be set to alter the behavior
of the program. The following is a simple configuration file that will be
enough to get you going. Copy this data to a file and name it something like
configuration.properties
in the root directory of your GitMiner install.
net.wagstrom.research.github.login=YOURGITHUBLOGIN
net.wagstrom.research.github.password=YOURGITHUBPASSWORD
net.wagstrom.research.github.email=YOUREMAILADDRESS
net.wagstrom.research.github.dbengine=neo4j
net.wagstrom.research.github.dburl=graph.db
net.wagstrom.research.github.projects=pridkett/gitminer
Alternatively, you may authenticate by using an OAuth token to be configured as follows in lieu of giving login and password.
net.wagstrom.research.github.token=YOUROAUTHTOKEN
See http://developer.github.com/v3/oauth/ for more information.
If you plan on using the Git Repository loading functionality then you'll need to set the following options that are specific to that functionality. In the future these options will be merged together, but for right now you'll just need to repeat them.
edu.unl.cse.git.dbengine=neo4j
edu.unl.cse.git.dburl=graph.db
edu.unl.cse.git.repositories=pridkett/gitminer
Execution of GitMiner is a two step process that consists of first using the GitHub API to download project data and then later using git directly to process project source code commits. The configuration file created in the last step has all the settings you'll need for stages.
To begin, run GitMiner so it downloads data from using the GitHub API:
./gitminer.sh -c configuration.properties
Next, use the repository loader functions of GitMiner to download the source code history for the projects.
./repo-loader.sh -c configuration.properties
For the most part we have attempted to provide sensible defaults for configuration parameters, however some parameters must have their values set for the tool to function.
name: net.wagstrom.research.github.login
default: no default
description: this parameter must be set. On October 14, 2012 GitHub
changed the way their API works and reduced the number of anonymous API
requests to 60/hour. With this parameter set you can get as many as 5000
requests an hour. Without this parameter GitMiner will just refuse to run.
name: net.wagstrom.research.github.password
default: no default
description: the companion to net.wagstrom.research.github.login
.
name: net.wagstrom.research.github.token
default: no default
description: this can be set instead of login and password to
authenticate with GitHub using the given OAuth token as documented on
http://developer.github.com/v3/oauth/.
name: net.wagstrom.research.github.email
default: no default
description: this is your email address. GitHub has requested that all
clients using the API provide additional mechanisms to identify themselves
via the user-agent. One of the ways that GitMiner accomplishes this is by
putting your email address in the user-agent string. Please be nice and set
this value accordingly.
name: net.wagstrom.research.github.projects
default: no default
description: a comma separated list of projects to begin spidering. For
example rails/rails,pridkett/gitminer,tinkerpop/blueprints
.
name: net.wagstrom.research.github.users
default: no default
description: a comma separated list of users to spider. For example
pridkett,jurgns,dhh
.
name: net.wagstrom.research.github.organizations
default: no default
description: a comma separated list of organizations to spider. For
example, 37signals,tinkerpop
.
name: net.wagstrom.research.github.refreshTime
default: 0.0
description: minimum number of days since the last update to download
information about a user or other element again. For most purposes you can
probably set this much higher. This will GREATLY speed up your crawls if you
set it to a high value.
name: net.wagstrom.research.github.apiThrottle.maxCalls.v3
default 4980
description: The maximum number of calls via the GitHub v3 API in a given
time period. Use this to rate limit under what the API says. I typically set
this value to 4980
or something like that to avoid problems when I hit API
limits. If the value is 0
then this is ignored.
name: net.wagstrom.research.github.apiThrottle.maxCallsInterval.v3
default: 3600
description: Time period (in seconds) to make the maximum number of calls
using the v3 GitHub API. Previously
some APIs allowed 60calls/min and others 5000/hr, but the API didn't set this.
Now it seems to always be 5000/hr, so this is generally set to 3600
.
name: net.wagstrom.research.github.miner.repositories
default: true
description: a true
/false
parameter on whether or not to download
data for the projects specified in net.wagstrom.research.github.users
property.
name: net.wagstrom.research.github.miner.repositories.collaborators
default: true
description: a true
/false
parameter on whether or not to downlaod
data for the collaborators listed for each project.
name: net.wagstrom.research.github.miner.repositories.contributors
default: true
description: a true
/false
parameter on whether or not to download
data for the contributors listed for each project.
name: net.wagstrom.research.github.miner.repositories.watchers
default: true
description: a true
/false
parameter for whether or not to download
data for the watchers listed for each project.
name: net.wagstrom.research.github.miner.repositories.forks
default: true
description: a true
/false
parameter for whether or not to download
data about forks for each project.
name: net.wagstrom.research.github.miner.repositories.issues
default: true
description: a true
/false
parameter for whether or not to download
data about issues for each project.
name: net.wagstrom.research.github.miner.repositories.pullrequests
default: true
description: a true
/false
parameter for whether or not to download
data about pull requests for each project.
name: net.wagstrom.research.github.miner.repositories.users
default: true
description: a true
/false
parameter for whether or not to download
all the data about all the users for each project. FIXME: I'm not certain
off the top of my head about what interplay this has with other settings above.
name: net.wagstrom.research.github.miner.users.events
default: true
description: a true
/false
parameter for whether or not to download the
public events stream for each user mined.
name: net.wagstrom.research.github.miner.users.gists
default: true
description: a true
/false
parameter for whether or not to download the
set of gists for each user mined.
name: net.wagstrom.research.github.miner.users
default: true
description: a true
/false
parameter for whether or not to download
any information for the users listed in net.wagstrom.research.github.users
.
name: net.wagstrom.research.github.miner.organizations
default: true
description: a true
/false
parameter for whether or not to download
any information for the organizations listed in net.wagstrom.research.github.organizations
.
name: net.wagstrom.research.github.miner.gists
default: true
description: a true
/false
parameter for whether or not to download
any gists at all.
name: net.wagstrom.research.github.dbengine
default: neo4j
description: the name of the Blueprints database backend
to use. Right now this has only been tested on neo4j
, orientdb
, and
tinkergraph
. This feature is dependent on the features present in
govscigraph.
name: net.wagstrom.research.github.dburl
default: github.db
description: the URL of the database to save to. For neo4j this is
simply the directory where the database exists.
In some cases, for some repositories, substantial java memory is required.
In these cases, setting the java memory as follows seems to work.
*this fixes ISSUE 34
export JAVA_OPTIONS="-Xms12g -Xmx12g"
If you'd like to see the output of gitminer without having to execute it, we have made two full datasets available on github.