How big was triplecheck data in 2015?

Last year was the first time that we started releasing the offline edition of our tooling for evaluating software originality. At the beginning this tool was released on a single terabyte USB drive. However, shipping the USB across normal post was difficult and the I/O speed that we could read data from the disk was peaking at 90Mbps, in turn this made scannings take way too long (defined as anything longer than 24 hours running).

As the year moved along, we kept reducing the disk space required for fingerprints and at the same time kept increasing the total number of fingerprints dispatched with each new edition.

In the end we managed to fit the basic originality data sets inside a special 240Gb USB thumb drive. When mentioning "special", I mean a drive containing two miniature SSD devices that are connected in hardware-based RAID 0 mode. For those unfamiliar with RAID, this means two disks working together and appearing on the surface as a single disk. The added advantage is reading data faster because you are physically reading from two disks and roughly doubling speed. Since it has no physical moving parts, the speed of the whole thing jumped to 300Mbps. My impression is that we didn't reach yet the peak speed for how fast data can be read from the device, our bottleneck simply moved to the CPU cores/software not being able to digest data faster. Due to contract reasons can't mention the thumb drive model, but this is a device in the range of $500 to $900. Certainly worth the price when scanning gets completed faster.

Another multiplier to high-speed and data size was compression. Tests were made to find a compression algorithm that wouldn't need much CPU to decompress and at the same time would reduce disk space. We settled for plain zip compression since it consumed minimal CPU and resulted in a good-enough ratio of 5:1. Meaning that if something was using 5Gb before, now it was only using 1Gb of disk space.

There is an added advantage to this technique besides disk space: now we were able of reading the same data almost 5x faster than before. If before we needed to read 5Gb from the disk, now this requirement got reduced to 1Gb for accessing the same data (discounting CPU load). It then became possible to fit 1Tb of data inside a 240Gb drive, reducing by 4x the needed disk space, while increasing speed by 3x with the same data.

All this comes to the question: How big was triplecheck data last year?

These are the raw numbers:
     source files: 519,276,706
     artwork files: 157,988,763
     other files: 326,038,826
     total files: 1,003,304,295
     snippet files: 149,843,377
     snippets: 774,544,948
         jsp: 892,761
         cpp: 161,198,956
         ctp: 19,708
         ino: 41,808
         c: 54,797,323
         nxc: 324
         hh: 20,261
         tcc: 27,974
         j: 2,190
         hxx: 446,002
         rpy: 2,457
         cu: 17,757
         inl: 337,850
         cs: 26,457,501
         jav: 1,780
         cxx: 548,553
         py: 189,340,451
         php: 229,098,401
         java: 94,896,020
         hpp: 6,481,794
         cc: 9,915,077
     snippet size real: 255 Gb
     snippet size compressed: 48 Gb

One billion individual fingerprints for binary files were included. 500 million (50%) of these fingerprints are source code files in 54 different programming languages. Around 15% of these fingerprints are related to artwork and this means icons, png, jpg files. The other files are usually included with software projects, things like .txt documents and such.

Over the year we kept adding snippet detection capabilities to mainstream programming languages. This means the majority of C-based dialects, Java, Python and PHP. On the portable offline edition we were unable to include the full C collection, it was simply too big and there wasn't much demand from customers to have it included (with only one notable customer exception across the year). In terms of qualified individual snippets we are tracking a total of 700 million across 150 million source code files. A qualified snippet is one that contains valid enough logical instructions. We use a metric called "diversity", meaning that a snippet is only accepted when it has a given percentage of logical commands inside. For example: a long switch or IF statement without other relevant code is simply ignored because this is not typically relevant from an originality point of view.

The body of data was built from relevant source code repositories available to public and a selection of websites such as fora, mailing lists and social networks. We are being picky about which files to include on the offline edition and only accept around 300 specific types of files. The collected raw data during 2015 went above 3 trillion binary files and much effort was applied to iterate this archive within weeks instead of months to build relevant fingerprint indexes.

For 2016 the challenge continues. There is a data explosion ongoing. We notice a 200% growth between 2014 and 2015, albeit this might be caused due to our own techniques for gathering data to have improved and no longer being limited by disk space as when first started in 2014. More interesting is remembering that the NIST fingerprints index had a relevant compendium of 20 million fingerprints in 2011 and that now we need technology to handle 50x as much data.

So let's see. This year I think we'll be using the newer models with 512Gb. A big question mark is if we can somehow squeeze more performance by using the built-in GPU that you find on modern computers today. Albeit this is new territory for our context and doesn't exist certainty that moving data between disk, CPU and GPU will bring added performance or be worth the investment. The computation is already light as it is, and not particularly suited (IMHO) for GPU type of processing.

The other field to explore is image recognition. We have one of the biggest archives of miniature artwork (icons and such) that you would find applied in software. There exist cases where the same icon is saved under different formats and right now we are not detecting such cases. The second doubt is if we should pursue this kind of detection because it is a necessary thing (albeit having no doubt it is a cool thing, thought). What I'm sure is that we already doubled the archive compared to last year and that soon we'll be creating new fingerprint indexes. Again starts the optimization to keep speed acceptable. Oh well, data everywhere. :-)