Last week was dedicated to index the users registered on GitHub.
Getting the users was a needed step in order to test how difficult it would be extracting information from the self-appointed largest (public) open source repository on earth.
To my surprise, it was relatively straight-forward and quick. Over the past months I've understood why GitHub is becoming the place of election for hosting. It was a neat user interface, provides incentives for developers to commit code and simply goes straight to what really matters.
To access the API was available a set of Java libraries (or you can just use the raw JSON format). There is JCapi (freeware) which is the one that I found more suited. The reason is because every now and then we'd expire the API rate limit (5000 requests/hour) and these libraries permitted to keep the connection in pause until the limit was lifted. There were some defects noted while using the library, however, the developers behind the library were pretty amazing. They solved most of the reported issues in light-speed.
With the library in place, I've wrote a java project to iterate through all the users registered on GitHub using the authorized API calls. At first was noted that the Internet connection broke very often, usually around 200k indexed users. I was doing these tests from the office network and was not stable enough. Then moved the software to a dedicated server that is continuously online at a datacenter. From there the connection lasted longer but still failed at 800k indexed users.
Being a stubborn person by nature, got myself thinking on how to resume the operation from the point where it had stopped. The API did provided a resume option from a given user name but the libraries didn't yet covered this point, to my luck the jcabi team was very prompt in explaining how the API call could be used. With the support for resuming available, was then a simple matter to read back the text file with each user per line and get the last one.
Didn't said it before but the storage media that I'm using are plain text files (a.k.a flat files). I'm not using a relational database, nor have I looked much into "NoSQL" databases. From all the options that I've tested over the years, nothings beats the processing simplicity of a plain text file to store large arrays of data. Please note that I emphasize simplicity. If performance had been affected significantly, I'd use a more suited method. However, turns out that normal desktop computers in 2014 can handle text files with millions of lines under a second.
With the resume feature implemented was then possible to launch a single machine for indexing the user names non-stop. After two days of processing, the indexing was complete and showed a total of 7.709.288 registered users.
At the time of this writing (July 2014), GitHub mentions in their press page a total of 6 million users. The text file containing these user names is sized in 81Mb. It contains one user name per line.
Perhaps in the future would be interesting to see a study noting the demographics, gender variation and other details from these user accounts. On my context, this is part of a triplecheck goal. Open source is open, but is not transparent enough. And that is what we are working to improve.
Following the spirit of open source and free software, the software used for generating this index of users is too available on github. It is released as freeware under the modified EUPL (European Public Licence) terms. You find the code and compilation instructions at https://github.com/triplecheck/gitfinder
This code does more than just indexing users, I'll be writing more about what is being done in future posts. In the meanwhile you might be asking yourself: where can I get this text file?
For the moment I didn't made the text file available on a definitive location. Will likely talk with the good people at FLOSSmole to see if they would be interested in hosting and keeping it easily reachable to other folks.
Forward we move.