Github Data Mining: 101

Francesco, publicly an Android developer at Novoda, is an undercover multi technologist who likes to hack with all kinds of systems and watches too many TV shows. Clear eyes, full hearts, can't lose.

This blog post will outline how Novoda mines organisation data from our Github projects, talk about the data mining architecture we thought of, helping you understand the Github API and improve the visibility of your organisation data.

Here at Novoda, we are proud to say that all decisions we take are data-driven. Why choose one technology over another? Why prefer a generic architecture over a specific one? Why develop a custom made solution over a pre-existing library?

In combination with this line of questioning, we’ve been asking ourselves how can we extract and analyse relevant information from Github around: team dynamics, developer time spent on pull requests, details of cross-project collaboration and how much open source projects participation there is.

Github Reports Objectives

We can break down the above description into a list of goals for a report system that is able to give us metrics regarding:

Understanding this data may reveal fundamental clues in driving actions to improve team organisation, identify if and where bottlenecks are, then proactively solve them.

Credits: Picture by Othree on Flickr

Exploring Github’s Available Data

Github already has a statistics page for each repo, but that is a bit limited and cannot respond to all of our requirements. So we began exploring all possibilities for retrieving these statistics using the Github APIs.

At first we thought of requesting data on demand. For instance, getting statistics about a user in a certain period of the year would be solved by querying that user’s data from the API. But then we asked ourselves how we could compare multiple users’ data, which led to the realisation that we should actually query data for all of organisation's (Novoda) developers at once.


For each report generation, then, we would need to do an unbelievable amount of API calls, since Novoda has ~130 repositories and we could assume that:


This means that mining 30 days worth of data means executing:

\begin{align} & (130~repos) * (1~issue~list + 5~issues) * (30~days) \simeq 23,400~requests \Rightarrow \\ \Rightarrow~ & \frac{23,400~requests}{5000~requests~per~hour} \simeq 5~hours \end{align}

Should any of these requests fail, or should we want to change the date range even by a day, we would have to run the whole querying process all over again!

Data Mining Github

As good Computer Science scholars, we know that the next step would be to get all of Novoda Github data in our own data mart, then mine it to extract all the pieces of information we need. In regards to the type of data we want to extract out of the database, it would only contain the minimum subset of information we need, making it as small and efficient as possible.

Github provides us with two ways to access its information:

Foreseeable issues

For requests using Basic Authentication or OAuth, you can make up to 5,000 requests per hour.

Currently set at 5000 request per hour, the generous Github rate limit allows for excellent integrations in minimum and medium-load clients, but it is definitely too low for us wanting to mine all of our historical data.

This issue is connected to another, more practical issue: where can we run the whole system? Downloading the history for the entire organisation would definitely take a few days at the very least (we’ve been active on Github since 2011), while the live one (using the WebHooks) would need to be always on and performant to save all live data.

Solution Architecture

After a lot of coffee and many Super Mario Kart 8 races, we came up with a multi-tiered solution that we believe is optimal for our use case.

Github Data Analysis

Studying the Github API and data, we figured out that the data we needed to retrieve is tree-shaped:

Tree of Github nodes

By building and going through this tree we know for sure that all the information will be received and persisted in an orderly fashion, otherwise all our database constraints would be invalid (e.g., repositories must be retrieved and stored before moving on to issues).

Brainstorming the Github data

Brainstorming the Github data

System view

The “live” system would run on the cloud, exposing an endpoint that listens to, converts and persists relevant information sent by Github.

The “historical/batch” system was a bit harder to figure out, since it is composed of simple operations that need to be repeated over time:

And what do you do when you have a complex, repeating sequence of steps? Easy, you make it recursive! Recursion helps you figure out the smaller parts of the problem and reduce them to a final state, that is: your solution.

Reducing from infinite to finite

We started calling our tree nodes “messages”, something that our system receives and knows how to handle, eventually reducing it to:

When we have a local termination, it means that we have reached a leaf node in the tree. Our system will then analyse and process one message:

  1. Start
  2. Accept a message in the “current position” within the tree
    a. The base step is “Get organisation”
  3. Perform the operation (recursive step)
    1. Executes the API request
    2. Saves data onto the database
    3. Creates 0 to n new messages in the tree
  4. Moves the “current position” within the tree to the “next position”
    1. The final step is reached when there are no more nodes
  5. Exit

To simplify this, we linearised the tree to a queue employing a breadth-first search, so that the concept of “next position” is more explicit. A depth-first search would have worked just fine, since both ensure the same relative order between tree nodes.

Linearising the tree in a queue

Linearising the tree in a queue

Handle API rate limit

What if the request fails because we have reached the API limit for the hour? Our solution is to schedule the same message on a queue for future analysis as soon as the API limit is reset by Github.

Let’s not worry about the actual implementation of the rescheduling system, keeping it abstract helps break all of the issues down and leave the implementation details for later.
There will be some kind of active mechanism that wakes up the queue and restarts the process and since our system will always pick up the first message in the queue (FIFO), we know for a fact that it will re-run the operation that failed before.

The same behaviour can be used for network/database errors, rescheduling the process to re-run immediately or after a short pause.

Using paging with live data

Something we have not considered yet is paging. To avoid exhausting the API limit in a short time, we should retrieve information in batches as large as possible. But the data we ask for isn’t stale: from the moment we ask for a page to the moment we ask for the following page new data might be added!

This may introduce two problems: data duplication and incomplete data.

Duplicated data

As pages are ordered by descending date, if new data is added, old data might be retrieved again in following pages, then saved twice onto the database. The naive solution would be to UPDATE OR INSERT all data received from the API… which is just what we did!

Simpler is better

Simpler is better

Incomplete data

Since new data shifts old data towards later pages, we might not be able to retrieve the old information we need, as the number of messages generated from the previous node corresponds to the number of pages the Github API said we had at that particular moment in time. This means that we might be told we have 3 pages at the start, but the new data has shifted to a total of 5 pages, while our original count is still at 3!

Our solution consists in retrieving the page from the API, then analysing if Github says it is the last one or not: if it is not, it means that new data has been generated and we need to request these new pages, i.e. add new messages to the queue.
This leads to a change of spec: a tree node does not always generate new children, but also siblings! Having linearised our tree in a queue, this is a merely academic observation.

Chang allows it.

Data results

All of the information we mined from Github ended up in a database full of repositories, pull requests, merges, closes, comments and much more.

All the data!

Our next blog posts will show you how we use all this data to find interesting correlations, patterns and outliers.

What’s next?

Running this system on a local machine means you have all the data only on your own local machine AND you can’t do much else whilst it is running! Our next blog posts will also detail the implementation of such architecture on Amazon Web Services, go through the challenges we faced and help you get set up if you want to achieve similar results.

About Novoda

We plan, design, and develop the world’s most desirable Android products. Our team’s expertise helps brands like Sony, Motorola, Tesco, Channel4, BBC, and News Corp build fully customized Android devices or simply make their mobile experiences the best on the market. Since 2008, our full in-house teams work from London, Liverpool, Berlin, Barcelona, and NYC.

Let’s get in contact