TripleCheck GmbH

The news are official, TripleCheck is now a German company in full right.

It was already a registered company since 2013. But it was labelled as a "hobby" company because of the unfortunate UG (haftungsbeschränkt) tag for young companies in Germany.

You might be wondering at this point: "what is the fuss all about?"

In Germany, a normal company needs to be created with a minimum of 25 000 EUR on the company bank account. When you don't have that much money, you can start a company with a minimum of one Euro, or as much you can, but you get labelled as UG rather than GmbH.

When we first started the company, I couldn't care less about UG vs. GmbH kind of discussion. It was only when we first started interacting with potential customers in Germany that we understood the problem. Due to this UG label, your company is seen as unreliable. It soon became frequent to hear: "I'm not sure if you will be around in 12 months" and this came with other implications such as banks refusing to grant us a credit card linked to the company account.

Some people joke and observe that UG stands for "Untergrund", in the sense that this type of company has strong odds of not floating and going under the ground within some months. Sadly true.

You see, in theory a company can save enough money on the bank account until it reaches 25k EUR to then upgrade. In practice, we can make money but at the same time have servers, salaries and other heavy costs to pay. Moving the tick to 25k is quite a pain. Regardless of how many thousands of Euros are made and then spent across the year, that yearly flow of revenue does not count unless you have a screenshot showing the bank account above the magic 25k.

This month we finally broke that limitation.

No more excuses. We simply went way above that constraint and upgraded the company into full GmbH. Ironically, this only happened when our team moved temporarily out of Germany and is opening a new office elsewhere in Europe.

Finally a GmbH. Have to say that Germany in some instances is very unfriendly to startups. The UG situation reduces chances of a young startup to compete at the same level as a GmbH company, even when the technology is notoriously more advanced on the UG company.

As example, in the United Kingdom you can get a Ltd. company and you are in the same standing level as the large majority of companies. Ironically, we could have registered a Ltd. in the UK without money on the bank and then use this status in Germany to look "better" than a plain boring GmbH.

The second thing that bothers me are taxes. As UG we pay the same level of taxes as a GmbH in full. Whereas in the UK you get tax breaks when starting an innovative company. In fact, you get back 30% of your expenses with developers (any expense considered as R&D) in that country. From Germany we only got heavy invoices of tax bills to pay every month.

I'm happy that we are based in Germany. We struggled to survive and move up to GmbH as you can see. We carved our place in Darmstadt against the odds. But Germany, you are really losing your competitive edge when there exist so many advantages for Europeans to open up startups in UK rather than DE. Let's try to improve that, shall we? :-)




Intuitive design for command line switches

I use the command line.

It is easy. Gets things done in a straightforward manner. However, the design of the command line switches is an everyday problem. Even for the tooling that is used more often, one is never able to memorize totally the switches required for common day-to-day actions.

https://xkcd.com/1168/


For example, want to rsync something?
rsync -avvzP user@location1 ./

Want to decompress some .tar file?
tar -xvf something.tar

The above is already not friendly but still more or less doable with practice. But now ask yourself (attention: google not allowed):

How do you find files with a given extension inside a folder and sub-folders?

You see, in Unix (Linux, Mac, etc) this is NOT easy. Just like so many other commands, a very common task was not designed with an intuitive usage in mind. They work, but in so much as you learn an encyclopedia of switches. Sure, there exist manual pages and google/stackoverflow to help but what happened to simplicity in design?

In Windows/ReactOS one would type:
dir *.txt /s

In Unix/Linux this is a top answer:
find ./ -type f -name "*.txt"

source: http://stackoverflow.com/a/5927391

Great. It is spread everywhere this kind of complication for everyday tasks. Want to install something, need to type:
apt-get install something

Since we only use apt-get for installing stuff, why not?
apt-get something

Whenever designing your next command line app. Would help end-users if you list which usage scenarios will be more popular and reduce to bare minimum the switches. Some people argue that switches deliver consistency and this is a fact. However, one should perhaps balance consistency with friendliness in mind, which in the end turns end-users into happy-users.

Be nice, keep it simple.

¯\_(ツ)_/¯

Question:
What is the weirdest Unix command that really upsets you?


Noteworthy reactions to this post:

- Defending Unix against simpler commands:
http://leancrew.com/all-this/2016/03/in-defense-of-unix
- This post ended stirring a fight of Linux > Windows
https://news.ycombinator.com/item?id=11229025




How big was triplecheck data in 2015?

Last year was the first time that we started releasing the offline edition of our tooling for evaluating software originality. At the beginning this tool was released on a single terabyte USB drive. However, shipping the USB across normal post was difficult and the I/O speed that we could read data from the disk was peaking at 90Mbps, in turn this made scannings take way too long (defined as anything longer than 24 hours running).

As the year moved along, we kept reducing the disk space required for fingerprints and at the same time kept increasing the total number of fingerprints dispatched with each new edition.

In the end we managed to fit the basic originality data sets inside a special 240Gb USB thumb drive. When mentioning "special", I mean a drive containing two miniature SSD devices that are connected in hardware-based RAID 0 mode. For those unfamiliar with RAID, this means two disks working together and appearing on the surface as a single disk. The added advantage is reading data faster because you are physically reading from two disks and roughly doubling speed. Since it has no physical moving parts, the speed of the whole thing jumped to 300Mbps. My impression is that we didn't reach yet the peak speed for how fast data can be read from the device, our bottleneck simply moved to the CPU cores/software not being able to digest data faster. Due to contract reasons can't mention the thumb drive model, but this is a device in the range of $500 to $900. Certainly worth the price when scanning gets completed faster.

Another multiplier to high-speed and data size was compression. Tests were made to find a compression algorithm that wouldn't need much CPU to decompress and at the same time would reduce disk space. We settled for plain zip compression since it consumed minimal CPU and resulted in a good-enough ratio of 5:1. Meaning that if something was using 5Gb before, now it was only using 1Gb of disk space.

There is an added advantage to this technique besides disk space: now we were able of reading the same data almost 5x faster than before. If before we needed to read 5Gb from the disk, now this requirement got reduced to 1Gb for accessing the same data (discounting CPU load). It then became possible to fit 1Tb of data inside a 240Gb drive, reducing by 4x the needed disk space, while increasing speed by 3x with the same data.

All this comes to the question: How big was triplecheck data last year?

These are the raw numbers:
     source files: 519,276,706
     artwork files: 157,988,763
     other files: 326,038,826
     total files: 1,003,304,295
     snippet files: 149,843,377
     snippets: 774,544,948
         jsp: 892,761
         cpp: 161,198,956
         ctp: 19,708
         ino: 41,808
         c: 54,797,323
         nxc: 324
         hh: 20,261
         tcc: 27,974
         j: 2,190
         hxx: 446,002
         rpy: 2,457
         cu: 17,757
         inl: 337,850
         cs: 26,457,501
         jav: 1,780
         cxx: 548,553
         py: 189,340,451
         php: 229,098,401
         java: 94,896,020
         hpp: 6,481,794
         cc: 9,915,077
     snippet size real: 255 Gb
     snippet size compressed: 48 Gb

One billion individual fingerprints for binary files were included. 500 million (50%) of these fingerprints are source code files in 54 different programming languages. Around 15% of these fingerprints are related to artwork and this means icons, png, jpg files. The other files are usually included with software projects, things like .txt documents and such.

Over the year we kept adding snippet detection capabilities to mainstream programming languages. This means the majority of C-based dialects, Java, Python and PHP. On the portable offline edition we were unable to include the full C collection, it was simply too big and there wasn't much demand from customers to have it included (with only one notable customer exception across the year). In terms of qualified individual snippets we are tracking a total of 700 million across 150 million source code files. A qualified snippet is one that contains valid enough logical instructions. We use a metric called "diversity", meaning that a snippet is only accepted when it has a given percentage of logical commands inside. For example: a long switch or IF statement without other relevant code is simply ignored because this is not typically relevant from an originality point of view.

The body of data was built from relevant source code repositories available to public and a selection of websites such as fora, mailing lists and social networks. We are being picky about which files to include on the offline edition and only accept around 300 specific types of files. The collected raw data during 2015 went above 3 trillion binary files and much effort was applied to iterate this archive within weeks instead of months to build relevant fingerprint indexes.

For 2016 the challenge continues. There is a data explosion ongoing. We notice a 200% growth between 2014 and 2015, albeit this might be caused due to our own techniques for gathering data to have improved and no longer being limited by disk space as when first started in 2014. More interesting is remembering that the NIST fingerprints index had a relevant compendium of 20 million fingerprints in 2011 and that now we need technology to handle 50x as much data.

So let's see. This year I think we'll be using the newer models with 512Gb. A big question mark is if we can somehow squeeze more performance by using the built-in GPU that you find on modern computers today. Albeit this is new territory for our context and doesn't exist certainty that moving data between disk, CPU and GPU will bring added performance or be worth the investment. The computation is already light as it is, and not particularly suited (IMHO) for GPU type of processing.

The other field to explore is image recognition. We have one of the biggest archives of miniature artwork (icons and such) that you would find applied in software. There exist cases where the same icon is saved under different formats and right now we are not detecting such cases. The second doubt is if we should pursue this kind of detection because it is a necessary thing (albeit having no doubt it is a cool thing, thought). What I'm sure is that we already doubled the archive compared to last year and that soon we'll be creating new fingerprint indexes. Again starts the optimization to keep speed acceptable. Oh well, data everywhere. :-)





2016

The last twelve months did not pass fast. It was a long year..

Family-wise changed. Some got affected by Alzheimer, others display old age all too early. My own mom got surgery for two different cancer cases and a foot surgery. My grandma of 80 y.o. broke a leg which is a problem in her age. In worse cases family members passed away. Too much, too often, too quick. I've tried to be present, to support the treatment expenses and somehow, just somehow help. Sadder events were the death of my wife's father. Happened over night just before Christmas, just too quick and unexpected. Also sad was the earlier death of our house pet dog, which was part of the family for a whooping 17 years. Was sad to see our old dog put to final sleep. He was in constant pain, couldn't even walk any more. Will miss our daily walks on the park that happened three times a day regardless of snow or summer. We'd just get out on the street for fresh air so he could do his own business. Many times enjoyed the sun outside the office thanks to him, still grateful for these good moments. A great moment in 2015 was the birth of my second son. A strong and healthy boy. Nostalgia when remembering the happy moment when my first child got born back in 2008. In the meanwhile since that year almost everything changed, especially maturity-wise. In 2009 I've made the world familiar to me fall apart and yet to this day feel sad about my own decisions that eventually broke the first marriage. I can't change the past, but I can learn, work and aim to become a better father for my children. This is what I mean about maturity, do that extra mile to balance family and professional activities. It is sometimes crazy but somehow there must be balance. This year had the first proper family vacations since 5 years, which consisted on two weeks at a mountain lake. No phone, no Internet. Had to walk a kilometer on foot to get some WiFi on the phone at night. This summer we were talking with investors and communication was crucial so the idea of vacations seemed crazy. In the end, family was given preference and after summer we didn't went forward with investors in either case. Quality time with family was what really mattered, lesson learned.

Tech-wise we did the impossible, repeatably. If by December 2014 we had an archive with a trillion binary files and struggled hard on how to handle the already gathered data, by the end of 2015 was estimated that we had 3x as much data now stored. Not only the availability of open data grew exponentially, we also kept adding new sources of data before it would vanish. If before we were targeting some 30 types of source code files related to mainstream programming languages, now we target around 400 different types of binary formats. In fact, we don't even target just files. At current day we see relevant data extracted from blogs, forum sites, mailing lists. I mention an estimation of data because only in February we'll likely be able to pause and compute rigorous metrics. There was an informal challenge at DARPA to account the number of source code lines that are publicly available to humanity in current times, we might be able to report back a 10^6 growth compared to an older census. Having many files and handling that much data with very limited resources is one part of the equation that we (fortunately) had already solved back in 2014. The main challenge for 2015 was how to find the needles of relevant information inside a large haystack of public source code within a reasonable time. Even worse, how to enable end-users (customers) to find these needles by themselves inside the haystack through their laptops in offline manner, without a server farm somewhere (privacy).  However, we did managed to get the whole thing working. Fast. The critical test was a customer with over 10 million LOC in different languages, written for the past 15 years. We were in doubt about such a large code base. But running the triplecheck tooling from a normal i7 laptop to crunch the matches required only 4 days, compared to 11 days when compared to other tools with smaller databases. That was a few months ago, in 2016 we are aiming to reduce this value down to a single day of processing (or less). Impossible is only impossible until someone else makes it possible. Don't listen to naysayers, just take as many steps as you need to go up a mountain, no matter how big it might be.


Business-wise was quite a ride. The top worst decisions (mea culpa) in 2015 was pitching our company at a European-wide venture capital event and trusting completely on outsourced sales without preparation for either. The first decision wasn't initially bad. We went there and got 7 investors interested in follow-up meetings. Very honest about where the money would be used, along with expected growth. However, the cliché that engineers are not good at business might be accurate. Investors speak a different language, there was disappointment for both sides. This initiative costed our side thousands of euros in travel and material costs, along with 4 months of stalled development. Worse was believing that the outsourced team could deliver sales (without being asked for a proof or test beforehand). Investors can invest without proof of revenue, but when someone goes to market then they want to wait and see how it performs. In our case, it didn't perform. Many months later we had paid thousands of EUR to the outsourced company and had zero product revenue to account from them. Felt like a complete fool for permitting this to happen and not putting a brake earlier. The only thing saving the company at this point was our business angel. Thanks to his support we kept getting new clients for the consulting activities. Majority of these clients became recurring M&A customers, this is what kept the company floating. Can never thank him enough, a true business angel in the literal sense of the expression. By October, the dust from outsourcing and investing were gone. Now existed certainty that we want to build a business and not a speculative startup. We finally got product sales moving forward by bringing aboard a veteran on this kind of challenge. For a start, no more giving our tools away for free during trial phase. I was skeptic but it worked well because this filtered our attention for companies that would pay upfront a pilot test. This made customers take the trial phase seriously since it had a real cost paid by them. The second thing was to stop using powerpoints during meetings. I prepare slides before customer meetings but this is counter-productive. More often than not, customers couldn't care less about what we do. Surprisingly enough they care about what they do and how to get their own problems solved. :-) Today exists a focus on hearing more than speaking at such meetings. Those two simple changes made quite a difference.


So, that's the recap from last year. Forward we move. :-)

Linux Mint 17 with Windows 10 look

This weekend finally took the time to upgrade Windows 7 on my old laptop and try out that button on the system tray with the free Windows 10 install.

Was surprised, that was an old laptop from 2009 that came with the stock Windows 7 version and still worked fairly OK. Have to say that the new interface, which is indeed looking better and simpler. The desktop is enjoyable, but the fact that this Windows version beams up to Microsoft whatever I'm doing with on my own laptop is still a bother and a cold shill on the spine.

On my newer laptop I run Linux Mint. This is an old version installed back in 2013 and could really use an update. So, since it was upgrade-weekend I've decided to simply go ahead and bring up this Linux machine to a more recent version of Mint and see what had changed over the past years. While doing this upgrade, a question popped up: "how about adding the design of Windows 10 with Linux underneath, would it work?"

And this is the result:
http://3.bp.blogspot.com/-VnAgpP-gck8/ViPwqC4VyqI/AAAAAAAANvI/sjDRrGqBLvg/s1600/Screenshot_2015-10-18_16-59-49.png


The intention wasn't creating a perfect look-a-like, but (in my opinion) to try mixing and getting a relatively fresh looking design based on Windows, at the same time without opening hand from our privacy.


Operating System

I've got Linux Mint 17.2 (codename Olivia, Cinnamon edition for x64) downloaded from http://www.linuxmint...tion.php?id=197

Instead of installing to disk, this time I've installed and now run the operating system from a MicroSD card connected to the laptop through the SD reader using an SD adapter. The MicroSD is a Samsung 64Gb with advertised speed of 40Mb/s for read operations. Cost was ~30 EUR.

Installing the operating system followed the same routine steps as one would expect. There is a GUI tool from within Linux mint to write the DVD ISO into a pendisk connected on your laptop. Then boot from the USB and install the operating system on the MicroSD, having the boot entry added automatically.


Window 10 theme and icons

Now that the new operating system is running, we can start the customization.

The windows style you find on the screenshot can be downloaded from: http://gnome-look.or...?content=171327

This theme comes with icons that look exactly like Windows 10, but that wasn't looking balanced nor was our intention to copy pixel per pixel the icons. Rather, the intention was re-using the design guidelines. While looking for options, found Sigma Metro which resembled what was needed: http://gnome-look.or...?content=167327

If you look around the web, you'll find instructions on how to change the window themes and icons. Otherwise if you get into difficulties, just write me a message and I'll help.


Firefox update and customization

Install Ubuntu Tweaks. From there, go to Apps tab and install the most recent edition of Firefox because the one included on the distro is a bit old.

Start changing Firefox by opening it up and going to "Addons" -> "Get Addons". Type on the search box "Simple White Compact", this was the theme that I found the simplest and will change the browser looks, from icons to tab position as you can see on the screenshot. Other extensions that you might enjoy adding while making these changes are "Adblock Plus" to remove ads, "Tab Scope" to show miniatures when browsing tabs and "Youtube ALL HTML5" to force youtube running without using the Adobe Flash Player.


Office alternative and customization

Then we arrive to Office. I only keep that oldish laptop because it has the Adobe Reader (which I use for signing PDF documents) and Microsoft Office for the cases when I need to modify documents and presentations without getting them to look broken. So, I was prepared this time to run both apps using Wine (it is possible) but decided to first do an update on the alternatives and try using only Linux native apps. Was not badly surprised.

LibreOffice 4.x is included by default on the distro. Whenever I'd use it, my slides formatted in MS Office would look broken and unusable. Decided to download and try out version 5.x and to my surprise notice that these issues are gone. Both the slides and word documents are now properly displayed with just about the same results that I'd expected from Microsoft office. I'm happy.

To install LibreOffice 5.x visit https://www.libreoff...reoffice-fresh/

For the Linux edition, read the text document with instructions. Quite straightforward, just one command line to launch the setup. So, I was happy with LibreOffice as a complete replacement to Microsoft (no need to acquire licenses nor run office through Wine). However, those icons inside LibreOffice still didn't look good, they looked old. On this aspect the most recent version of Microsoft Office simply "looks" better. I wanted LibreOffice to look that way too. So, got icons from here: http://gnome-look.or...?content=167958

It wasn't straightforward to find out where the icons could be placed because the instructions for version 4.x no longer apply. To help you, the zip file with icons need to be placed inside:
/opt/libreoffice5.0/share/config/

Then you can open up "writer" and from the "Tools" -> "Options" -> "View" choose "Office2013" and get the new icons being used. The startup logo of LibreOffice also seemed too flashy and could be changed. So I've changed with the one available at http://gnome-look.or...?content=166590

Just a matter of overwriting the intro.png image found at:
/opt/libreoffice5.0/program


Alternative to Adobe Reader for signing PDF

Every now and then comes a PDF that requires being printed, signed by pen and then scanned to send again to the other person. I stopped doing this kind of thing some time ago by adding a digital signature that includes an image of my handwritten signature on the document. This way there's no need to print nor scan any papers. Adobe Reader did a good work on this task but getting it to run on Wine with the signature function was not straightforward.

Started looking for a native Linux alternative and found "Master PDF Editor". The code for this software is not public but I couldn't find other options and these were the only ones that provided a native Linux install supporting digital handwritten signatures: https://code-industr...asterpdfeditor/

If you're using this tool for business, you need to acquire a license. Just for home-use is free of cost. Head out to the download page and install the app. I was surprised because it looked very modern, simple and customizable. I'll buy a license for this tool, does exactly what I needed. Having LibreOffice and MasterPDF as complete alternative to MS Office and Acrobat,  there is no more valid reason (on my case) to switch back the old laptop whenever editing documents. This can be done with same (or even better) quality from Linux now.


Command line

A relevant part of my day-to-day involves the use of command line. In Linux this is a relatively pleasant task because the terminal window can be adjusted, customized and never feels like a second class citizen inside the desktop. With these recent changes that were applied, was now possible to improve further the terminal window by showing the tool bar (see the screenshot).

Open a terminal, click on "View" -> "Show tool bar". Usually I'm against adding buttons, but that tool bar has a button for pasting clipboard text directly onto the console. I know that can be done by the keyboard using "Ctrl '+ Shift + V", but found it very practical to just click on a single button and paste the text.


Non-Windows tweaks

There are tweaks only possible on Linux. One of my favorite keeps being the "Woobly windows". Enable Compiz on the default desktop environment: http://askubuntu.com...-wobbly-windows

With Compiz there are many tweaks possible, I've kept them to a minimum but certainly is refreshing to use some animations rather than the plain window frames. If you never saw this in action, here is a video example: https://www.youtube.com/watch?v=jDDqsdrb4MU


Skype alternatives

Many of my friends and business contacts use Skype. It is not safe, it is not private, and I'd prefer to use a non-Microsoft service because the skype client gets installed on my desktop. Who knows what it can do on my machine when it is running on the background. One interesting alternative that I've found was launching the web-edition of skype that you find at https://web.skype.com/

From firefox, there is the option to "Pin" a given tab. So I've pinned skype as you can see on the screenshot, and now opens automatically whenever the browser gets open, in practice bringing it online when I want to be reachable. A safe desktop client and alternative would be better, this is nowhere a perfect solution but rather a compromise that avoids installing the skype client.


Finishing

There are more small tweaks happening to adjust the desktop for my case, but what is described above are the big blocks to help you reach this kind of design in case you'd like to do something similar. If you have any questions or get stuck at any part of customization, just let me know.

Have fun!
:-)

.ABOUT format to document third-party software

If you are a software developer, you know that every now and then someone asks you to create a list of the third-party things that you are using on some project.

This is a boring task. Ask any, single, motivated developer and try to find one that will not roll his eyes whenever asked to do this kind of thing. We (engineers) don't like it, yet are doomed to get this question every now and then. It is not productive to repeat the same thing over and over again, why can't someone make it simpler?

Waiting a couple of years didn't worked, so time to roll up the sleeves and find an easier way of getting this sorted. To date, one needs to list manually each and every portion of code that is not original (e.g. libraries, icons, translations, etc) and this will either end up on a text file or a spreadsheet (pick your poison).

There are ways to manage dependencies. Think of npm, maven and similar. However, you need to be using a dependency manager and this doesn't solve the case of non-code items. For example, when you want to list that package of icons from someone else, or just list dependencies that are part of the project, but not really part of the source code (e.g. servers, firewalls, etc).

For these cases, you still need to do things manually and it is painful. At TripleCheck, we don't like ourselves to do these lists so started looking into how to automate this step once for all. Our requirements: 1) simple, 2) tool-agnostic and 3) portable.

So we got inclined to the way how configuration files work because they are plain text files that are easy for humans to read or edit, and straightforward for machines to parse. We are big fans of SPDX because it permits describing third-party items in intrinsic detail, but a drawback of being so detailed is that sometimes we only have granular information. Example, we know that the files on a given a folder belong to some person and have a specific license (maybe we even know the version), but we don't want to compute the SHA1 binary signature for each and every file on that folder (either because the files might change often, or simply because it won't be done so easily and quickly by the engineer).

Turns out we we're not alone on this kind of quest. NexB had already pioneered in previous years a text format specifically for this kind of task, defining the ".ABOUT" file extension to describe third-party copyrights and applicable licenses: http://www.aboutcode.org/


The text format is fairly simple, here is an example we use ourselves:
 
name: jsTree
license_spdx: MIT
copyright: Ivan Bozhanov
version: 3.0.9

spec_version: 1.0
download_url: none
home_url: http://jstree.com/

# when was this ABOUT file created or last updated?
date: 2015-09-14

# files inside this folder and sub-folders
about_resource: ./

Basically, it follows the SPDX license abbreviations to ensure we use a common way of talking about the same license and you can add or omit information as much as it is available. Take attention on the "about_resource" field that describes what is covered by this ABOUT file. When using "./" means all files and files in respective sub-folders.

One interesting point is the possibility for nesting of multiple ABOUT files. For example, place one ABOUT on the root of your project to describe the license terms generally applicable to the project and then create specific ABOUT on specific third-party libraries/items to describe what is applicable for such cases.

When done with the text file, place it on the same folder of what you want to cover. The "about_resource" can also be used for a single file, or repeated in several lines for covering a very specific set of files.

NexB made available tooling to collect ABOUT files and generate documentation. Unfortunately, this text format is not as known as it should be. Still, it fits like a glove as easy solution to list third-party software so we started using it for automating the code detection.

Our own TripleCheck engine is now supporting the recognition of .ABOUT files and adding this information automatically to the report generation. There is even a simple web frontend for creating .ABOUT files at http://triplecheck.net/components/

From that page, you can either create your own .ABOUT files or simply browse through the collection of already created files. The backend of that web page is powered by GitHub, you find the repository at https://github.com/dot-about/components/tree/master/samples


So, no more excuses to keep listing third-party software manually on spreadsheets.

Have fun! :-)









Something is cooking in Portugal

I don't usually write about politics, for me that is more often a never-ending discussion about tastes, rather than facts.

However, one senses a disturbance in the forces at Portugal. For the first time over the last (35?) years we see a change in landscape. For those non-familiar with Portuguese politics, the country is historically ruled by either one of the two large parties. Basically, one "misbehaves" and then comes the other to "repair". Vice-versa on next elections as voters grow anemic and disconnected from whomever gets elected.

This year wasn't the case. The ruling party is seen as "misbehaving" and the other party didn't got a majority, in other words, didn't convinced a significant part of the population to vote for them. This isn't unusual, what happened as different was the large number of votes going to other two minor parties and the fact that most citizens got up from their sofas to vote who "rules" them for the next years.

For the first time, I'm watching how the second largest party is now forced to negotiate with these smaller parties to reach an agreement. How since a long time they have to review what was promised during election time and get audited by other parties to ensure they keep what was promised.

In other words, for the first time watching what I'd describe as a realistic democratic process happening in our corner of Europe. Might seem strong words, but fact is that ruling a government by majority (in our context) is a carte blanche to rule over public interests. Go to Portugal, ask if they feel the government works on their behalf or against. Ask them for specific examples from recent years that support their claim, they quickly remember epic fights to prevent expensive airports from being built (Ota) by government or the extensive (and expensive) network of highways that got built with EU money and are today empty, still serving only the private interest of companies charging tolls on them.

There was (and still exists) a too-high level of corruption on higher instances of government (just look at our former prime-minister, recently in jail) or the current prime-minister (ask him about "tecnoforma" or about his friend "Dr. Relvas") and so exists a positive impact when small parties get higher voting representation, forcing the majority administrations to be audited and checked in public.

You see, most of this situation derives from a control of mind-share. In previous centuries you'd get support from local cities by promoting your party followers to administrative positions. Later came newspapers (which got tightly controlled), then radio (eventually regulated to forbid rogue senders), then TV (which to date has only two private channels and two state-owned channels) and now comes the Internet.

With the Internet there is a problem. The local government parties with majority are not controlling the platforms where people exchange their thoughts. Portuguese use facebook (hate or like it, that's what common families and friends use between them) and facebook couldn't (currently) care less about elections in Portugal, nor could either of the large parties have resources to make facebook biased to their interests. So what we have is a large platform where public news can be debunked as false or plain biased, where you can see how other citizens really feel about the current state of affairs, where smaller parties get a balanced chance to be read, heard and now even voted by people who support what they stand up for.

For the first time I see the Internet making a real difference in enabling people to be connected between themselves and enabling the population to collectively learn and change the course of their history, together. As for the Portuguese, you see the big parties worried that this thing of re-elections in automatic pilot is no longer assured. They too need to work together now. Portuguese, please do keep voting. For me this is democracy in action. Today I'm happy.


TripleCheck as a Top 20 Frankfurt startup to watch in 2015

Quite an honor and surprise, we got appointed with this distinction despite the fact that we don't see ourselves so much as a startup, but rather as a plain normal company worried about getting to the next month and growing with its own resources.

Looking back, things are much better off today than a year ago. Our schedule is busy at 150% of client allocation and we managed to survive through plain normal consulting, finally moving to product sales this year with a good market reception so far. Team grew, we finally have a normal office location and I keep worrying each month that the funds in the bank are not enough to cover expenses. Somehow, on that brink of failure or success we work hard to pay the bills and invest in material or people that permits moving a bit further each month.

It is not easy, this is not your dream story and we don't know what will happen next year. What I know is that we are pushed to learn more and grow. That kind of experience has a value of its own.

Next step for triplecheck is building in 2015 our own petabyte-level datacenter in Frankfurt. Efficiency of costs aside, we are building a safe-house outside of the "clouds" where nobody really knows who has access to them.

I wish it was time for vacations or celebrate, but this is not yet the time. I'm happy that together with smart and competent people we are building a stable company.


List of >230 file extensions in plain JSON format

I've collected over the last year some 230 file extensions and manually curated their descriptions so that whenever I find a file extension, it becomes possible to give the end-user a slight idea about what the extension is about.


Most of my code nowadays is written in Java but there is interest in porting some of this information to web apps. So I have exported a JSON list that you are welcome to download and use in your projects.

The list is available on GitHub at this link.

One thing to keep in mind is that I'm looking at extensions from a software developer perspective. This means that when the same extension is used for different programs, I usually favor the programs related to programming.

The second thing is that I collect more information about file extensions than the info you find on this JSON list. For example, I populate for each extension the applicable programming languages. Here is an example for .h source code files. Other values include information if the data is plain binary or text readable, the category to which the extension belongs (archive, font, image, sourcecode, ..) and other meta data values that are useful for file filtering and processing.


If you need help or would like to suggest something to improve the list, just let me know.

Updating the header and footer on static web sites using Java

This year was the first time that I've moved away from websites based on Wordpress, PHP and MySQL to embrace the simplicity of static HTML sites.

Simplicity is indeed a good reason. It means virtually no exploits as there is no database nor script interpretation happening. It means speed since there are no PHP, Java nor Ruby scripts running on the server and only direct files are delivered. The last feature that I was curious to try is the site hosting provided by Github, which is only supporting static web sites.

The first site to convert was the TripleCheck company site. It had been developed over a year ago and lagged a serious update. Was based on Wordpress and wasn't easy to make changes on the theme or content. The site was quickly converted and placed online using Github.

However, not all are roses with static websites. As you can imagine, one of the troubles is updating the text and links that you want to see on each page of the site. There are tools such as Jekyll that help to maintain blogs, but all that was needed here was a simple tool that would pick the header and footer tags to updated with whatever content was intended.

Easy enough, I've wrote a simple app for this purpose. You can download the binaries from this link and the source code is available at https://github.com/triplecheck/site_update/


How to get started?

Place the site_update.jar file inside the folder where your web pages are located. Then copy also the html-header.txt and html-footer.txt files and write inside the content you'd want to use as header and footer.

Inside the HTML pages that you want to change, you need to include the following tags:
<header></header>
<footer></footer>

Once you have this ready, from the command line run the jar file using:
java -jar site_update.jar

Check your HTML pages to see if the changes were applied.


What happens when it is running?

It will look for all HTML files with .html extension that are found on the same folder where the .jar file is located. For each HTML file it will look for the HTML tags that were mentioned above and replace whatever is placed between them, effectively updating your pages as needed.

There is an added feature. If you have pages on a sub-folder, this software will automatically convert the links inside the tags so that they keep working. For example, a link pointing to index.html will be modified to ../index.html and this way preserve the link structure. This is done also for images.

An example where this program used can be found at the TripleCheck website, whose code you find available on Github at https://github.com/triplecheck/triplecheck.github.io


Feedback, new features?

I'd be happy to help. Just let me know on the comment box here or write a post on Github.





List of 310 software licenses in JSON format

I've recently needed a list of licenses to use inside a web page. The goal was presenting the end-user with a set of software licenses to choose from. However, couldn't find one readily available as a JSON or some kind of format to be embbeded as part of Javascript code.

So I've created such a list, based on the nice SPDX documentation. This list contains 310 license variations and types. I'm explicitly mentioning "types" because you will find licenses called "Proprietary" to define some sort of terms that are customized and a "Public domain" type, which is not a license per se but in practice denotes the lack of an applicable license since copyright (in theory) is not considered as applicable for them.

In case you are ok with these nuances, you can download this json list from https://github.com/triplecheck/engine/blob/master/run/licenseList.js

The list was not crafted manually, I've wrote a few lines of Java code to output the file. You find this file at https://github.com/triplecheck/engine/blob/master/src/provenance/javascript/OutputLicenseList.java

If you find the list useful and have feedback or need an updated version, just let me know.





SSDEEP in Java

If you are familiar with similarity hashing algorithms (a.k.a. fuzzy hash matching) and need an SSDEEP implementation in Java code, it is available directly from my Github account at this location: https://github.com/nunobrito/utils/tree/master/Utils/src/utils/hashing/ssdeep

The original page for SSDEEP can be found at http://ssdeep.sourceforge.net/

On that page you find also the binaries for Windows.

Have fun.

Preserving the soul of an old laptop

If you're like me and keep old laptops around the house that are wannabe time-capsules, I've recently started converting the physical operating systems onto virtual machines that I can run from a PC emulator.

The concept is called P2V (Physical To Virtual) and has been made simpler over recent years. My favorite tool for this purpose is provided by VMWare at http://www.vmware.com/products/converter

It is a freeware tool, albeit you have to provide an email address to access the download page. What I like about the tool is the fact that the most difficult steps are automated. All one needs to do is installing, convert and run the new virtual machine through a wizard-driven menu with a few clicks.

Being a VMWare tool you'd think that it restricts running the virtual image to their line of products. However, I was able to use VirtualBox to run and see my old Windows 7 booting and running from a virtual machine.

Very nice, to be able of preserving the old look & feel, the apps, documents and working environment in such a quick manner as hardware moves forward.

Windows: Driver for logging the timing of drivers and services at startup

Sometimes it is good to measure how long a laptop with Windows will take to boot and which drivers or services might be hogging down the boot process. There exist some ways of measuring the time using Microsoft-provided tooling but they aren't redistributable.

To overcome this limitation, I've wrote a simple driver that will write a text file with a time stamp when each other driver or service gets called. This way we can (more or less) expose which drivers or services are taking longer to be loaded.

This is a sample of what to expect:
18/02/2015 13:16:40.437, Driver, 4, \SystemRoot\System32\Drivers\crashdmp.sys
18/02/2015 13:16:40.453, Driver, 4, \SystemRoot\System32\Drivers\iaStor.sys
18/02/2015 13:16:40.453, Driver, 4, \SystemRoot\System32\Drivers\dumpfve.sys
18/02/2015 13:16:40.812, Driver, 4, \SystemRoot\system32\DRIVERS\cdrom.sys
18/02/2015 13:16:40.812, Driver, 4, \SystemRoot\System32\Drivers\Null.SYS
18/02/2015 13:16:40.828, Driver, 4, \SystemRoot\System32\Drivers\Beep.SYS
18/02/2015 13:16:40.843, Driver, 4, \SystemRoot\System32\drivers\watchdog.sys
18/02/2015 13:16:40.843, Driver, 4, \SystemRoot\System32\drivers\VIDEOPRT.SYS
18/02/2015 13:16:40.843, Driver, 4, \SystemRoot\System32\drivers\vga.sys
18/02/2015 13:16:40.843, Driver, 4, \SystemRoot\System32\DRIVERS\RDPCDD.sys
18/02/2015 13:16:40.859, Driver, 4, \SystemRoot\system32\drivers\rdpencdd.sys
18/02/2015 13:16:40.859, Driver, 4, \SystemRoot\system32\drivers\rdprefmp.sys
18/02/2015 13:16:40.859, Driver, 4, \SystemRoot\System32\Drivers\Msfs.SYS
18/02/2015 13:16:40.875, Driver, 4, \SystemRoot\System32\Drivers\Npfs.SYS
18/02/2015 13:16:40.875, Driver, 4, \SystemRoot\system32\DRIVERS\TDI.SYS

The code is available under the EUPL terms and hosted on GitHub at this location: https://github.com/nunobrito/BootLogger

On the download folder you find the compiled drivers (x86 and x64 versions) along with the instructions on how to use the driver on your machine.

Feedback from other users can be read at reboot on this topic:
http://reboot.pro/topic/20345-driver-for-logging-windows-boot-drivers-and-services/

Each boot log report will be placed under c:\BootLogger, this parameter is configurable in case you want to change it.

Have fun!
:-)






Olhando à frente

Olhando à frente
existe rumo diferente.
Rumo que dita o futuro,
de curto tempo e alento
para escapar o tormento
que traz o curto momento.
Assim temos um ano
pouco sano e profano
que de tal visto amanho
só pode trazer mais dano.
Serão dez meses a terminar
esta pequena obra d'encantar,
que deu tanto gosto de começar,
e tão pouco tempo para saborear
Imagino como seria o dia
em que o peso desaparecia.
Um dia correndo de alegria,
iria apreciar, seria magia
Tal dia chegará
um dia, oxalá.

Java hidden gem: CopyOnWriteArrayList()

CopyOnWriteArrayList() is a cousin of the well-known ArrayList() class.

ArrayList is often used for storing items. On my case, I had been working on a multi-threaded program that shared a common ArrayList.

In order to improve performance, every now and then I would like to remove some of the items on this list when matched some criteria. In the past I would use the Iterator() class to iterate through item using the iterator.next() function.

To remove an item I'd just call iterator.delete(). However, this approach was failing for some odd reason:
java.util.ConcurrentModificationException at java.util.AbstractList$Itr.checkForComodification (AbstractList.java:372)
I tried to place synchronized on the relevant methods but processing just got slower, not solving the error failure.

So, what else can one try? Looking around the web I've found the not-so-known CopyOnWriteArrayList() and to my surprise solved the problem with a nice performance boost.

Works in the same manner as a typical Arraylist but doesn't synchronize the items when they are removed. To remove items I use a second Arraylist that is decoupled and place the items to remove there. Then, an independent status thread is running in interval loops of three seconds to check if this second Arraylist has any items, removing them from the main list in asynchronous manner.

All in all, running the code in multi-threaded mode and adopting CopyOnArrayWriteArrayList() reduced the overall processing time for 17 million lines of data from 30 minutes to around 10 minutes, an average of 30k lines/second. The text database used as example is sized in 12,3 Gb and contains 2.5 billion snippets that are compared against 164 methods of my test sample.

This translates to roughly 41 billion comparisons taking place in 10 minutes.

As reference, when my computer is just reading the lines without any processing then it reaches an average speed of 140k lines/second, this value reveals the upper I/O limit expected as disk bandwidth. The speed of 30k lines/second occurs (probably) due to CPU limitations (an i7 core) when doing similarity comparisons between strings.


The performance is not bad, but at this point I'm running out of ideas on how to further bring down the processing time. The bottleneck is still the comparison algorithm, I've already wrote a cheaper/dirty version of Levensthein's algorithm for faster comparisons but still is not enough.


Any ideas?


EDIT

After some more time looking on performance I've noted that comparison of two strings was being made using String objects. There was redundant transformation back and forth between char[] and String objects. The code was modified to run using only char[] arrays. Speed was doubled, is now averaging 60k lines/second, taking 5 minutes to complete the same processing because less stress is placed on the CPU.




Java: RandomAccessFile + BufferedReader = FileRandomReadLines

In the Java world when reading large text files you are usually left with two options:
  1. RandomAccessFile
  2. BufferedReader

Option 1) allows to read text from any given part of the file but is not buffered, meaning that it will be slow to read lines.

Option 2) is buffered, therefore fast but you need to read each line from the beginning of the text file until you reach where you want to really read data.

There are strategies to cope with these mutually exclusive options, one is to read data sequentially, another option is to partition data into different files. However, sometimes you just have that case where you need to resume some time consuming operation (think on a scale of days) where billions of default sized lines are involved. Neither option 1) nor option 2) will suffice.

Up to this point I was trying to improve performance, remove any IF's and any code that could squeeze a few more ounces of speed but the problem remained the same: we need an option 3) that mixes the best of both options. There wasn't one readily available that I could find around the Internet.

In the meanwhile I have found a hint that might be possible to feed a BufferedReader directly from a RandomAccessFile. Tested this idea and was indeed possible, albeit still with some rough edges.

For example, if we are already reading data from the BufferedReader and decide to change the file position on the RandomAccessFile object, the BufferedReader will get erroneous data on the buffer. The solution that I've applied is to simply re-create a new BufferedReader, forcing the buffer to be reset.


Now, I'm making available the code that combines the these two approaches. You find the RandomAccessFile class at https://github.com/nunobrito/utils/blob/master/Utils/src/utils/ReadWrite/FileRandomReadLines.java

Has no third-party dependencies, you are likely fine by just downloading and including it on your code. Maybe there is already similar implementation elsewhere published before, I didn't found one and tried as much as possible to find some ready-made code.

If you see any improvements possible, do let me know and I'll include your name on the credits.

A trillion files

2014 has come to an end, so I'm writing a retrospective about what happened and what might be coming down the road in 2015.

For me, the year had the first milestone reached in February with a talk about SPDX and open source in FOSDEM. At that time I was applying to a position as co-chair for the SPDX working group but another candidate in Europe was chosen, apparently more suited.

Nevertheless, I kept throughout the year with my work related to the SPDX open format. In FOSDEM was debuted the first graphical visualizer for SPDX documents, in the process was written a license detection engine to find common software licenses and place this information on newly generated SPDX documents.

On the TripleCheck side, funding was a growing difficulty across the year. After FOSDEM there was urgency in raising funds to keep the company running. At that point we had no MVP (minimum viable prototype) to show and investors had no interest in joining the project. Despite our good intentions and attempts to explain the business concept, we didn't had the needed presentation and business skills to move forward. The alternative option for funding without depending on investors was the EUREKA funding from the EuroStars program.

For this purpose was formed a partnership with an aerospace organization and another company well matured in the open source field. We aimed to move a step forward in terms of open source licensing analysis. After months of preparation, iteration and project submission we got a reply: not accepted. The critique that pained me the most was reading that our project would be open source, therefore unable to maintain a sustainable business because competitors would copy our work. Maybe they have a point, but being open source ourselves is our leverage against competitors since this is a path they will not cross and that opened the doors of the enterprise industry to what we do. Open sourced companies are hard to succeed, despite the hard path I wasn't willing to see us become like the others.

In parallel, people had been hired in previous months to work on the business side of TripleCheck but it just wasn't working as we hoped. The focus then moved strictly to code development and reach an MVP but this wasn't working from a financial perspective either. At this point my own bank savings were depleted, the company reduced back to the two original founding members and seemed the end of the story for yet another startup that tried their luck. We did not had the finances, nor the team, nor the infrastructure to process open source software in large scale.


Failure was here, was time to quit and go home. So, as an engineer I just assumed failure as a consolidated fact. Now with everything failed, there was nothing to lose. The question was "what now?"

There was enough money in the bank to pay rent and stay at home for a couple of months. Finding a new job is relatively easy when you know your way around computers. It was a moment very much like a certain song where the only thing occupying the mind was not really failure, but the fact that I remained passionate about solving the licensing problem and wanted to get this project done.

So, let's clear the mind and start fresh. No outside financing from VC, no ambitious business plans, no business experts, no infrastructure nor any resources other than what is available right now. Let's make things work.


Kept working on the tooling, kept moving forward and eventually got approached by companies that needed consulting. TripleCheck was no longer a startup looking for explosive growth, it had now the modest ambition of making enough to pay the bills and keep working with open source.

Consulting on the field of open source compliance is not easy when you're a small company. While bigger consulting companies on this field can afford to just give back a report listing what is wrong with the code of a client, we had to do the same, plus putting our hands to change the code and make it compliant. Looking back in time, this was one heck of way to get expertise in complete and fast-forward manner. 

Each client became a beta-tester for the tooling developed at the same time. This meant that the manual process was incrementally replaced with an automated method. Our tooling got improved with each analysis that brought different code to analyze, different requirements and different licenses to interpret. At the some point the tooling got so accurate that could now detect licensing faults on the open source code from companies such as Microsoft.

At this time surfaced our first investor. A client was selling his company and he got amazed with the work done while inspecting his code. For me this was one of those turning points, now we had a business expert on our side. Our old powerpoint pitch-decks were crap, nobody really understood why someone needed a compliance check. But this investor had lived through the pain of not having his code ready for acquisition and how relevant this code repair had been. This had become an opportunity to bring aboard a person with first hand experience as a client that we didn't had to explain why it mattered to fix licensing with a tool, not an expert human.



With his support more business got done and our presentation improved. Was now possible to move forward. One of the goals in mind was the creation of an independent open source archive. In August we reached the mark of 100 million source code files archived. A new type of technology dubbed "BigZip" was developed for this purpose since normal file systems and archives were ill suited for this scale of archive processing. A good friend of mine described nicely this concept as a "reversed zipped tar". Meaning that we create millions of zip files inside a single file, the reverse action of what tar.gz does in Linux.

This way got solved the problem of processing files in large numbers. To get files from the Internet was developed a project called "gitFinder" that retrieved over 7 million open source projects. Our first significant data-set had been achieved.

In August was time for the first presence with a stand for TripleCheck on a conference, the FrOSCon. At this event we already had developed a new technology that was able to find snippets of code which were not original. It was dubbed "F2F", based on a humour inspired motto: "Hashes to Hashes, FOSS to FOSS" as a mock to the fact that file hashes (MD5, SHA1, ..) were used for exposing FOSS source code files inside proprietary code.


This code created a report indicating the snippets of code that were not original and where else on the Internet they could be found. For this purpose I wrote a token translator/comparator and a few other algorithms to detect code similarity. The best memory that I have from this development happened when writing part of the code on a boat directly in front of the Eiffel tower. When you're a developer, these are the memories that one remembers with a smile as years pass.

Shortly later in October, TripleCheck got attention at LinuxCon in Europe. For this event we brought aboard a partnership with http://searchcode.com to create or view online an SPDX document from a given repository on GitHub. In the same event we made available a DIY project that enabled anyone to generate 1 million SPDX documents. To provide context, the SPDX format is criticized by the lack of example documents available to public. The goal was making available as many documents as possible. Sadly, no public endorsement from the SPDX working group came to this kind of activities. To make matters worse, too often my emails went silently ignored on the mailing list whenever proposing improvements. That was sad, had real hopes to see this open standard rise.


Can only wonder if the Linux Foundation will ever react. I'm disenchanted with the SPDX direction but believe we (community) very much need this open standard for code licensing to exist, so I keep working to make it reachable and free of costs.


From November to December the focus was scaling our infrastructure. This meant a code rewrite to apply lessons learned. The code complexity was simplified to a level where we can keep using inexpensive hardware and software where only 1~2 developers are needed to improve the code.

The result was a platform that reached by the end of December the milestone of one trillion files archived. In this sense we achieved what others said to be impossible without the proper funds. These files belong to several million projects around the web that are now ready for use in future code analysis. For example, upcoming in 2015 is the introduction of two similarity matching algorithms converted to Java. One of them is TLSH from TrendMicro and the second is SDHash. This is code that we are directly donating back to the original authors after conversion and will be testing to see how it performs on code comparisons.


In retrospective I am happy. We passed through great and collapsing moments, lived through a journey that builds code that others can reuse. I'm happy that TripleCheck published more code in a single year than any other licensing compliance provider has ever done over the term of their existence, which in most cases is above a decade.

At the end of day after TripleCheck is long gone, it is this same code that will remain public and reachable for other folks to re-use. Isn't knowledge sharing one of the cornerstones of human revolution? In 2014 we have helped human knowledge about source code to move forward, let's now start 2015.



The old new things

You probably wouldn't know this, but I keep an old computer book on my table.

The title is "Análise e programação de computadores", published in 1970 in Portuguese language by Raúl Verde. Should be noted that the first computer book written in my native language had been published only two years earlier, by the same author.

I keep this book on my table because it teaches the basic principles that an aspiring computer programmer should aim to understand. It focuses on the programming of high-level languages such as Cobol and Fortran. Discusses the tradeoffs between using cards to store data vs paper/magnetic tapes or provide formulas to expose the strong points and make informed decisions.

This book could seem awkward if published today, mostly because it mixes software development with engineering, architecture and business interests that have grown specialized. Yet, these principles remain the same and should row on the same direction. Albeit technology grew and matured, truth is that we keep working to push into new boundaries, squeezing performance out of what is now reachable and to somehow make things work.

In this sense, I find it curious how much of the software I'm developing today is holding a strong resemblance with the old techniques on the book. The usage of flat-files, the console output that helps to monitor results, the considerations about storage and its impact on speed/capacity/costs.

Paraphrasing a blog title, I'm reminded that humans do old new things.

This month I had the task of downloading over a billion files from the Internet on a single run. Compare the writing on this book with the current tasks today, they are not different. Even the term "scale" is no excuse when we place these challenges into context.

And yet, as I look to a thousand lines per minute scrolling through that silent terminal, this gives me a pause of happiness. Data is getting processed, things are working, the system is fast enough.

Each line on this screen represents a repository of code and files for a given person. What kind of work or history was that person developing? Then I think on the millions of other repositories still left to process, this is something pretty to just watch as it scrolls.

In some weird kind of sense for a computer geek, looking at these lines on a terminal feels for me as looking at the stars at night. Imagining how each bright dot is around its own sun, planets and history.

Would these two things be comparable? Perhaps yes, perhaps not.

The old new thing is something to think about.




Java: Reading the last line on a large text file

Following the recent batch of posts related to text files, sometimes is necessary to retrieve the last line on a given text file.

My traditional way of doing this operation is to use the buffered reader and iterate all lines until the last one is reached. This works relatively fast, at around 400k lines per second on a typical i7 CPU (4 cores) at 2.4Ghz in 2014.

However, for text files with hundreds of million lines this approach grows increasingly too slow.

One solution is using RandomAccessFile to access any point of the file without delay. Since we don't know the exact position of the last line, a possible solution is to iterate each of the last characters until the break line is found.

Seeking one position at a time and just reading a single char might not be the most efficient approach. So, reading a buffer with 1000 characters at a time is a possible improvement on a future implementation.

Nevertheless, the code snippet below solves my issue and gets the last line on a large text file under a millisecond, regardless of the number of lines on the file.

    /**
     * Returns the last line from a given text file. This method is particularly
     * well suited for very large text files that contain millions of text lines
     * since it will just seek the end of the text file and seek the last line
     * indicator. Please use only for large sized text files.
     * 
     * @param file A file on disk
     * @return The last line or an empty string if nothing was found
     * 
     * @author Nuno Brito
     * @author Michael Schierl
     * @license MIT
     * @date 2014-11-01
     */
    public static String getLastLineFast(final File file) {
        // file needs to exist
        if (file.exists() == false || file.isDirectory()) {
                return "";
        }

        // avoid empty files
        if (file.length() <= 2) {
                return "";
        }

        // open the file for read-only mode
        try {
            RandomAccessFile fileAccess = new RandomAccessFile(file, "r");
            char breakLine = '\n';
            // offset of the current filesystem block - start with the last one
            long blockStart = (file.length() - 1) / 4096 * 4096;
            // hold the current block
            byte[] currentBlock = new byte[(int) (file.length() - blockStart)];
            // later (previously read) blocks
            List<byte[]> laterBlocks = new ArrayList<byte[]>();
            while (blockStart >= 0) {
                fileAccess.seek(blockStart);
                fileAccess.readFully(currentBlock);
                // ignore the last 2 bytes of the block if it is the first one
                int lengthToScan = currentBlock.length - (laterBlocks.isEmpty() ? 2 : 0);
                for (int i = lengthToScan - 1; i >= 0; i--) {
                    if (currentBlock[i] == breakLine) {
                        // we found our end of line!
                        StringBuilder result = new StringBuilder();
                        // RandomAccessFile#readLine uses ISO-8859-1, therefore
                        // we do here too
                        result.append(new String(currentBlock, i + 1, currentBlock.length - (i + 1), "ISO-8859-1"));
                        for (byte[] laterBlock : laterBlocks) {
                                result.append(new String(laterBlock, "ISO-8859-1"));
                        }
                        // maybe we had a newline at end of file? Strip it.
                        if (result.charAt(result.length() - 1) == breakLine) {
                                // newline can be \r\n or \n, so check which one to strip
                                int newlineLength = result.charAt(result.length() - 2) == '\r' ? 2 : 1;
                                result.setLength(result.length() - newlineLength);
                        }
                        return result.toString();
                    }
                }
                // no end of line found - we need to read more
                laterBlocks.add(0, currentBlock);
                blockStart -= 4096;
                currentBlock = new byte[4096];
            }
        } catch (Exception ex) {
                ex.printStackTrace();
        }
        // oops, no line break found or some exception happened
        return "";
    }

If you're worried about re-using this method. Might help to assure that I've authored this code snippet and that you are welcome to reuse this code under the MIT license terms. You are welcome to improve the code, there is certainly room for optimization.

Hope this helps.

:-)