Too many permissions for cellphone apps

It seems that nowadays for any simple Android cellphone app you need to give access for:
  • Your location
  • Your contacts
  • Your WIFI network name
  • Your phone ID
  • Reading what other apps are installed
  • Disk access

Unfortunately we have Android being slow to solve this issue, which has already been solved since years in other Android distributions such as CyanogenMod.

In either case, it is possible to remove some of the permissions nowadays. You can install the app as required, then before running it for the first time you should go to Settings -> Apps -> YourApp

From there, you can deny the permissions to access private data. Guess what? Most apps keep working exactly as normal, which makes one wonder why they request them in the first place.

Stay safe.

IPFS: the missing link for our future

There is a new way to store files online, it is called IPFS:

This is a decentralized network that functions on top of the Internet. The idea is that you can publish a file of any kind online and anyone else can decide to host the same file so you are not alone.

We have torrents. We have similar projects from the past and likely more in the future that are available, why would this one be special?

This one is easy to use. Has a catchy name and simply works.

For those interested in cyber-archaeology, the worse fear is that any given server will fail at some point. Geocities went gone, thousands of forum sites disappear per year, let alone the small zip files and other resources that we will not be finding again, so soon.

IPFS proposes a good way to preserve that link with our future, before it goes missing. Uploading new files is simple, straightforward and anonymous. The potential is there.

Imagine our forum software being rewritten one day to simply store the content on text files and permit any viewer to iterate and use this data. Even as the original server goes offline, the forum itself would continue to function as read-only on the very least.

Same for image and attachment hosting. Today, the server for one forum is hosting the attachments and images that are posted by end-users. With IPFS exists the option for any user to store those same files and thus preserve them when they are no longer available on the original location.

This matters not only for the future, it matters too for geographies where Internet access blocked to certain sites today. Or even better, just imagine how unimaginably difficult it is today to read a forum site without being monitored online through your operating system, the network cables, the web browser and the javascript libraries that simply tell the whole world what you are doing online, at any given moment.

In the end of the day: decentralization is the basis of our Internet.
It is our place where anyone, anywhere, anyhow can share knowledge at anytime.

Let's keep it that way.

Try out IPFS by yourself, and see today how the future looks like..

10 things to learn from the 1 400 000 000 passwords/emails leaked to public

Just writing 1.4 billion doesn't work.

To visually understand how big this recent leak of data was, you really need to count slowly the zeros on the title of this post.

That's data that anyone with some time will be able to find. It is not awfully recent, it is from about 2016 and most of the major websites such as google, linkedin, dropbox and similar have already forced their customers to change the password they were using.

Still, (and this is a big still), The amount of information that you can extract from this database with 1400000000 user accounts is simply gigantic.

10 things anyone can learn about you:
  1. Knowing your old password means that anyone can also query that same password and find other email accounts that you are using  (for example, gmail accounts)
  2. An attacker can likely spot a pattern that they can try in other sites. For example: "linkedin1970" as password will give a hint that they can try at other sites replacing the "linkedin" portion
  3. For big organizations, it is hundreds if not thousands of email addresses from real employees that can now be targeted for phishing
  4. Passwords are intimate, often reveal what is on the mind of the user. Some passwords are too revealing (e.g. sexual orientation, religion, romantic partners) and this information can be used against them (blackmail, defamation)
  5. Revealing identities, you have people belonging to a company or organization that do not want this information to be public
  6. Email patterns, learn the pattern under which the emails are created such as "", "", "" or some other combination that helps attackers to guess the email address of another person inside the same company that they want to target
  7. Discovering your nationality or real name, based on the country portion of the domains where your accounts are using
  8. Discovering previous companies where a person has worked
  9. Get direct email access to the CEO/CTO of smaller companies
  10. Passwords hint your security knowledge. Looking at the same organisation, a person using special characters will look more knowledgeable than another using only simple words. This helps attackers to pick users likely to fall for social engineering traps  

The potential for misuse and abuse is there.

Passed a good part of last week looking at the data, cleaning up the records and verifying their authenticity. This data is real, even my mom had her password listed there.

Some cases were just weird. While looking up for the name of a known criminal as test, the first match indicates that he had an email account with a very small email provider in Switzerland.  In other cases such as the accounts from domains belonging to football clubs, the large majority of these passwords included the name of the football club inside them (e.g. "benfica1"). One of these clubs had recently passed through problems as their emails got leaked to public. After looking at their password practices, I can really understand why it wasn't that difficult to guess them.

What seems more troubling is the amount of people using their company emails for registration in external sites. Certainly in many cases it is a necessary action, can't stress enough that this type of thing should be avoided as much as possible.

Change your passwords and use two-step authentication when available. Over the next two weeks we will see so many people losing their privacy, so please change your own passwords without delay.

Want to help your friends? Make sure they read this page so they can also learn. That's good karma being built on 2018 right from the start.

Stay safe out there.

Flat files are faster than databases (for my purpose)

I've tried. I've honestly tried (again) to use databases for the purpose of storing tens of million entries but it is just too slow.

Over the past week I've been trying different approaches for storing file hashes inside a database, so that it would become possible to do neat queries over the data without needing to re-invent the wheel like we have done in past.

Our requirements are tough because we are talking about tens of million, if not billion entries and being capable of providing database answers on plain normal laptops without any kind of installation happening. The best candidate for this job seemed to be H2 database, after previously trying SQLite and HSQLDB.

The first tries were OK with a small sample but then became sluggish when adding data in larger scale. There was a peak of performance at the beginning and then it would creep to awful slowness as the database got bigger. Further investigation helped with the performance, disabling the indexation, cache and other details that would get on the way of a mass data insert. Still, I would wait 1~2 days and the slow speed wouldn't give confidence that the data sample (some 4Tb with 19 million computer files) could be iterated in useful time.

So, went back to the old school methods and used plain flat files using CSV format. It was simpler to add new files. It was easy to view with a text editor if things were being written as intended, it was simpler to count how much data had already been inserted during the data extraction. And it was fast, not simply fast, it was surprisingly fast and completed the whole operation under 6 hours.

It is frustrating to try using a fancy database and then end up using the traditional methods, simply because they are reliable and faster than other options. I've tried, but for now will continue using flat files to index large volumes of data.

Getting started with the H2 database

H2 database is a small and compact way to store data directly from Java, especially because it can use simple binary files as storage. My goal with these things is performance and large scale indexing of data. When speaking about large scale, I'm talking about hundreds of million rows.

On this blog post I'm adding some of the things that are useful for those interested in debugging this database.

Starting an interactive shell

java -cp h2.jar -url jdbc:h2:./ -user xyz -password 123456

Where you should adjust:
  • h2.jar is the library jar file from H2 (just a single file)
  • ./ is the filename for your database
  • "xyz" and "123456" are the user and password


Hello World (writing your first data and database to disk)

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.Statement;
public class Start {
     * @param args
    public static void main(String[] args)
            Connection con = DriverManager.getConnection("jdbc:h2:~/test", "test", "" );
            Statement stmt = con.createStatement();
            //stmt.executeUpdate( "DROP TABLE table1" );
            stmt.executeUpdate( "CREATE TABLE table1 ( user varchar(50) )" );
            stmt.executeUpdate( "INSERT INTO table1 ( user ) VALUES ( 'Claudio' )" );
            stmt.executeUpdate( "INSERT INTO table1 ( user ) VALUES ( 'Bernasconi' )" );
            ResultSet rs = stmt.executeQuery("SELECT * FROM table1");
            while( )
                String name = rs.getString("user");
                System.out.println( name );
        catch( Exception e )
            System.out.println( e.getMessage() );

This example was retrieved from Claudio Bernasconi (thank you!):

Relevant commands

To list your tables inside the database:

To list everything inside a table:

Where "MyTable" is naturally the name for your table.

Showing which INDEXES are associated with a table:
SELECT * FROM information_schema.indexes WHERE table_schema = 'PUBLIC' AND table_name='MyTable';

Relevant links related to performance

Several switches to improve performance:

Some people say that an active INDEX will slow down large scale data entry:!msg/h2-database/7U99fWMaMw0/FbpkXgWIrDcJ

Learn to add multiple data fields per SQL call to speed things up further:

New name

One of the new year resolutions is to make changes where changes were due.

And one of those changes is the online identity. It has been some years since my first name is perfectly comprehensible in Portuguese/Spanish language but always causing some confusion in other languages. Too often my (already short) first name with just four letters would be confused in different variations. Other times I would be left explaining what is the root origin of that name. It doesn't help one bit as most of my time is passed traveling and living in non-Latin lands.

In the end of the day, it isn't something that I see value in continuing because my own preference of a first name is different. My preference is for an international name that works good in whatever location. So, name changed. This transition should only be completed around 2020. In the meanwhile please do accept my apologies in advance for any confusion this might bring until the dust settles.

If you met me through the old name, please do continue referring to the old name if you really prefer. From here forward, you might find my name written simply as Max.

Compiling Java apps as native x86 Windows executables

This was quite a holy grail some years ago.

The flexibility of Java turned into an easy to run application for Windows. If you are interested in compiling your Java-based software into native Windows applications, then you probably looked into options such as GCJ and wrappers. The problem is that GCJ is more like a myth nowadays where it is not clear if that thing works and wrappers, well. They are wrappers.

There is an alternative. A way to compile Java code directly into x86 binaries that run under Windows without needing .NET or any other intermediate platform. You are welcome to enter the world of Mono and IKVM.

You should read this page:

The first step is compiling your jar file into a .NET executable using IKVMC, this is as simple as downloading the most recent tools from:

Attention: these tools only run on Windows 7 and above (you can likely run in XP by installing a modern .NET version).

Then place your jar file inside the .\bin folder. Head out to the command prompt on that folder and type:
ikvmc myApp.jar

Where "myApp.jar" is obviously the jar you want to transform. If you have additional jars to be added as classpath, look on the syntax page for instructions. On my case I already ship a "fat jar" where every other jar files get merged into a single one in any case.

The compilation will likely show you that some classes are missing, that is mostly references on the code for things that were not found. On my case none of them were used so not really a problem, they were just warnings and you get a Windows .NET executable as output.

When running for the first time it failed. Was complaining that the a method was returning a null value.  There is something to know about IKVM, it is an implementation of the JVM but it is not perfect. So you might need to adapt slightly the code so that it doesn't touch on the methods not supported by IKVM. On this case, I just removed the part of the code that was failing since it was not critical and then everything worked.

Now you have a .NET executable.

You will need to ship it together with some of the IKVM dll files. The minimum files to include are IKVM.OpenJDK.Core.dll and IKVM.Runtime.dll. However, you might need additional DLL files from the IKVM distribution depending on the functionalities that you are using. This is mostly a trial and error approach until you find which files are really needed. The end result is that you get an executable that you won't need to ship with a JVM or ask end-users to install one (under Windows).

The final step in case you are interested, is to remove the .NET dependency. This might be useful for the cases where you want to run on bare-metal Windows from older versions such as Windows 95, all the way up to newer versions where x86 code is supported. That is where Mono helps. Take a look on this section:

Taking the executable created on the previous steps,  basically install the Mono setup (about 1Gb of disk space after installation) and type:
mono --aot myApp.exe

This will generate a set of files, refer to the previous link for additional details. Hopefully making your app run completely native under Windows. I'm yet to test the performance of native x86 Java binaries vs the bytecode versions running on the JVM. My initial expectation is that the Oracle/OpenJDK JVM is faster because but this might be a false impression.  If you manage to test the performance just let me know.

Have fun! :-)

The value of privacy

At TripleCheck things are never stable nor pretty, same old news. However, data archival and search algorithms kept booming beyond expectations, both of them estimated to grow 10x over the next 12 months as we finally add up more computing power and storage. So, can't complain about that.

What I do complain today is about privacy. More specifically, the value of anonymity. One of the business models we envisioned for this technology is the capacity to break source code apart and discover who really wrote each part of the source code. We see it as a wonderful tool for plagiarism detection, but one of the remote scenarios also envisioned was uncovering the identity for hidden authors of malware.

Malware authorship detection had been an hypothesis. It made good sense to help catch these malware authors, the "bad guys" and bring them to justice when committing cyber-warfare.  There hadn't been many chances to test this kind of power measuring over the past year because we are frankly too busy with other topics.

But today that changed. Was reading the news about an IoT malware spread into the wild, whose source code got published in order to maximize damage:

For those not understanding why someone makes malware code public on this context, it is because "script kiddies" will take that code, make changes and amplify its damaging reach. The author was anonymous and nowadays seems easy to just blame the Russians for every hack out there. So I said to myself: "let's see if we can find who really wrote the code".

Downloaded the source code for the original Mirai malware, which can be downloaded from:

Scanned the source code through the tool and started seeing plagiarism matches on the terminal.

What I wasn't expecting was that it generated such a clear list. In less than 10 minutes had already narrowed the matches to a single person on the Internet. For a start, he surely wasn't Russian. I've took the time to go deeper and see what he had been doing in previous years, previous projects and areas of interest. My impression is that he might feel disgruntled with "the system", specifically about the lack of security and privacy that exists nowadays. That this malware was his way of demonstrating to public that IoT can be too easily exploited and this is urgent to change.

And then I was sad.

This didn't looked like a "bad guy", he wasn't doing it for profit. This was a plain engineer like me. I could read his posts and see what he wrote about this lack of device security to no avail, nobody listened. Only when something bad happens, people listen. Myself couldn't care less about IoT malware until this exploit was out in the wild, so what he did worked.

If his identity would now be revealed, this might mean legal repercussions for an action that in essence is today forcing manufacturers to fix their known security holes (they wouldn't fix otherwise because it costs them extra money per device).

Can we really permit cases where after talking gets nothing done, only an exploit forces these fixes to happen in the future?

I don't know the answer. All that I know is that an engineer with possibly good intentions released the source code to fix a serious security hole before it would grow bigger (IoT devices grow every year). That person has published the code under the presumption of anonymity, which our tech is now be able to uncover and possibly bring damage to a likely good person and engineer.

TripleCheck GmbH

The news are official, TripleCheck is now a German company in full right.

It was already a registered company since 2013. But it was labelled as a "hobby" company because of the unfortunate UG (haftungsbeschränkt) tag for young companies in Germany.

You might be wondering at this point: "what is the fuss all about?"

In Germany, a normal company needs to be created with a minimum of 25 000 EUR on the company bank account. When you don't have that much money, you can start a company with a minimum of one Euro, or as much you can, but you get labelled as UG rather than GmbH.

When we first started the company, I couldn't care less about UG vs. GmbH kind of discussion. It was only when we first started interacting with potential customers in Germany that we understood the problem. Due to this UG label, your company is seen as unreliable. It soon became frequent to hear: "I'm not sure if you will be around in 12 months" and this came with other implications such as banks refusing to grant us a credit card linked to the company account.

Some people joke and observe that UG stands for "Untergrund", in the sense that this type of company has strong odds of not floating and going under the ground within some months. Sadly true.

You see, in theory a company can save enough money on the bank account until it reaches 25k EUR to then upgrade. In practice, we can make money but at the same time have servers, salaries and other heavy costs to pay. Moving the tick to 25k is quite a pain. Regardless of how many thousands of Euros are made and then spent across the year, that yearly flow of revenue does not count unless you have a screenshot showing the bank account above the magic 25k.

This month we finally broke that limitation.

No more excuses. We simply went way above that constraint and upgraded the company into full GmbH. Ironically, this only happened when our team moved temporarily out of Germany and is opening a new office elsewhere in Europe.

Finally a GmbH. Have to say that Germany in some instances is very unfriendly to startups. The UG situation reduces chances of a young startup to compete at the same level as a GmbH company, even when the technology is notoriously more advanced on the UG company.

As example, in the United Kingdom you can get a Ltd. company and you are in the same standing level as the large majority of companies. Ironically, we could have registered a Ltd. in the UK without money on the bank and then use this status in Germany to look "better" than a plain boring GmbH.

The second thing that bothers me are taxes. As UG we pay the same level of taxes as a GmbH in full. Whereas in the UK you get tax breaks when starting an innovative company. In fact, you get back 30% of your expenses with developers (any expense considered as R&D) in that country. From Germany we only got heavy invoices of tax bills to pay every month.

I'm happy that we are based in Germany. We struggled to survive and move up to GmbH as you can see. We carved our place in Darmstadt against the odds. But Germany, you are really losing your competitive edge when there exist so many advantages for Europeans to open up startups in UK rather than DE. Let's try to improve that, shall we? :-)

Intuitive design for command line switches

I use the command line.

It is easy. Gets things done in a straightforward manner. However, the design of the command line switches is an everyday problem. Even for the tooling that is used more often, one is never able to memorize totally the switches required for common day-to-day actions.

For example, want to rsync something?
rsync -avvzP user@location1 ./

Want to decompress some .tar file?
tar -xvf something.tar

The above is already not friendly but still more or less doable with practice. But now ask yourself (attention: google not allowed):

How do you find files with a given extension inside a folder and sub-folders?

You see, in Unix (Linux, Mac, etc) this is NOT easy. Just like so many other commands, a very common task was not designed with an intuitive usage in mind. They work, but in so much as you learn an encyclopedia of switches. Sure, there exist manual pages and google/stackoverflow to help but what happened to simplicity in design?

In Windows/ReactOS one would type:
dir *.txt /s

In Unix/Linux this is a top answer:
find ./ -type f -name "*.txt"


Great. It is spread everywhere this kind of complication for everyday tasks. Want to install something, need to type:
apt-get install something

Since we only use apt-get for installing stuff, why not?
apt-get something

Whenever designing your next command line app. Would help end-users if you list which usage scenarios will be more popular and reduce to bare minimum the switches. Some people argue that switches deliver consistency and this is a fact. However, one should perhaps balance consistency with friendliness in mind, which in the end turns end-users into happy-users.

Be nice, keep it simple.


What is the weirdest Unix command that really upsets you?

Noteworthy reactions to this post:

- Defending Unix against simpler commands:
- This post ended stirring a fight of Linux > Windows

How big was triplecheck data in 2015?

Last year was the first time that we started releasing the offline edition of our tooling for evaluating software originality. At the beginning this tool was released on a single terabyte USB drive. However, shipping the USB across normal post was difficult and the I/O speed that we could read data from the disk was peaking at 90Mbps, in turn this made scannings take way too long (defined as anything longer than 24 hours running).

As the year moved along, we kept reducing the disk space required for fingerprints and at the same time kept increasing the total number of fingerprints dispatched with each new edition.

In the end we managed to fit the basic originality data sets inside a special 240Gb USB thumb drive. When mentioning "special", I mean a drive containing two miniature SSD devices that are connected in hardware-based RAID 0 mode. For those unfamiliar with RAID, this means two disks working together and appearing on the surface as a single disk. The added advantage is reading data faster because you are physically reading from two disks and roughly doubling speed. Since it has no physical moving parts, the speed of the whole thing jumped to 300Mbps. My impression is that we didn't reach yet the peak speed for how fast data can be read from the device, our bottleneck simply moved to the CPU cores/software not being able to digest data faster. Due to contract reasons can't mention the thumb drive model, but this is a device in the range of $500 to $900. Certainly worth the price when scanning gets completed faster.

Another multiplier to high-speed and data size was compression. Tests were made to find a compression algorithm that wouldn't need much CPU to decompress and at the same time would reduce disk space. We settled for plain zip compression since it consumed minimal CPU and resulted in a good-enough ratio of 5:1. Meaning that if something was using 5Gb before, now it was only using 1Gb of disk space.

There is an added advantage to this technique besides disk space: now we were able of reading the same data almost 5x faster than before. If before we needed to read 5Gb from the disk, now this requirement got reduced to 1Gb for accessing the same data (discounting CPU load). It then became possible to fit 1Tb of data inside a 240Gb drive, reducing by 4x the needed disk space, while increasing speed by 3x with the same data.

All this comes to the question: How big was triplecheck data last year?

These are the raw numbers:
     source files: 519,276,706
     artwork files: 157,988,763
     other files: 326,038,826
     total files: 1,003,304,295
     snippet files: 149,843,377
     snippets: 774,544,948
         jsp: 892,761
         cpp: 161,198,956
         ctp: 19,708
         ino: 41,808
         c: 54,797,323
         nxc: 324
         hh: 20,261
         tcc: 27,974
         j: 2,190
         hxx: 446,002
         rpy: 2,457
         cu: 17,757
         inl: 337,850
         cs: 26,457,501
         jav: 1,780
         cxx: 548,553
         py: 189,340,451
         php: 229,098,401
         java: 94,896,020
         hpp: 6,481,794
         cc: 9,915,077
     snippet size real: 255 Gb
     snippet size compressed: 48 Gb

One billion individual fingerprints for binary files were included. 500 million (50%) of these fingerprints are source code files in 54 different programming languages. Around 15% of these fingerprints are related to artwork and this means icons, png, jpg files. The other files are usually included with software projects, things like .txt documents and such.

Over the year we kept adding snippet detection capabilities to mainstream programming languages. This means the majority of C-based dialects, Java, Python and PHP. On the portable offline edition we were unable to include the full C collection, it was simply too big and there wasn't much demand from customers to have it included (with only one notable customer exception across the year). In terms of qualified individual snippets we are tracking a total of 700 million across 150 million source code files. A qualified snippet is one that contains valid enough logical instructions. We use a metric called "diversity", meaning that a snippet is only accepted when it has a given percentage of logical commands inside. For example: a long switch or IF statement without other relevant code is simply ignored because this is not typically relevant from an originality point of view.

The body of data was built from relevant source code repositories available to public and a selection of websites such as fora, mailing lists and social networks. We are being picky about which files to include on the offline edition and only accept around 300 specific types of files. The collected raw data during 2015 went above 3 trillion binary files and much effort was applied to iterate this archive within weeks instead of months to build relevant fingerprint indexes.

For 2016 the challenge continues. There is a data explosion ongoing. We notice a 200% growth between 2014 and 2015, albeit this might be caused due to our own techniques for gathering data to have improved and no longer being limited by disk space as when first started in 2014. More interesting is remembering that the NIST fingerprints index had a relevant compendium of 20 million fingerprints in 2011 and that now we need technology to handle 50x as much data.

So let's see. This year I think we'll be using the newer models with 512Gb. A big question mark is if we can somehow squeeze more performance by using the built-in GPU that you find on modern computers today. Albeit this is new territory for our context and doesn't exist certainty that moving data between disk, CPU and GPU will bring added performance or be worth the investment. The computation is already light as it is, and not particularly suited (IMHO) for GPU type of processing.

The other field to explore is image recognition. We have one of the biggest archives of miniature artwork (icons and such) that you would find applied in software. There exist cases where the same icon is saved under different formats and right now we are not detecting such cases. The second doubt is if we should pursue this kind of detection because it is a necessary thing (albeit having no doubt it is a cool thing, thought). What I'm sure is that we already doubled the archive compared to last year and that soon we'll be creating new fingerprint indexes. Again starts the optimization to keep speed acceptable. Oh well, data everywhere. :-)


The last twelve months did not pass fast. It was a long year..

Family-wise changed. Some got affected by Alzheimer, others display old age all too early. My own mom got surgery for two different cancer cases and a foot surgery. My grandma of 80 y.o. broke a leg which is a problem in her age. In worse cases family members passed away. Too much, too often, too quick. I've tried to be present, to support the treatment expenses and somehow, just somehow help. Sadder events were the death of my wife's father. Happened over night just before Christmas, just too quick and unexpected. Also sad was the earlier death of our house pet dog, which was part of the family for a whooping 17 years. Was sad to see our old dog put to final sleep. He was in constant pain, couldn't even walk any more. Will miss our daily walks on the park that happened three times a day regardless of snow or summer. We'd just get out on the street for fresh air so he could do his own business. Many times enjoyed the sun outside the office thanks to him, still grateful for these good moments. A great moment in 2015 was the birth of my second son. A strong and healthy boy. Nostalgia when remembering the happy moment when my first child got born back in 2008. In the meanwhile since that year almost everything changed, especially maturity-wise. In 2009 I've made the world familiar to me fall apart and yet to this day feel sad about my own decisions that eventually broke the first marriage. I can't change the past, but I can learn, work and aim to become a better father for my children. This is what I mean about maturity, do that extra mile to balance family and professional activities. It is sometimes crazy but somehow there must be balance. This year had the first proper family vacations since 5 years, which consisted on two weeks at a mountain lake. No phone, no Internet. Had to walk a kilometer on foot to get some WiFi on the phone at night. This summer we were talking with investors and communication was crucial so the idea of vacations seemed crazy. In the end, family was given preference and after summer we didn't went forward with investors in either case. Quality time with family was what really mattered, lesson learned.

Tech-wise we did the impossible, repeatably. If by December 2014 we had an archive with a trillion binary files and struggled hard on how to handle the already gathered data, by the end of 2015 was estimated that we had 3x as much data now stored. Not only the availability of open data grew exponentially, we also kept adding new sources of data before it would vanish. If before we were targeting some 30 types of source code files related to mainstream programming languages, now we target around 400 different types of binary formats. In fact, we don't even target just files. At current day we see relevant data extracted from blogs, forum sites, mailing lists. I mention an estimation of data because only in February we'll likely be able to pause and compute rigorous metrics. There was an informal challenge at DARPA to account the number of source code lines that are publicly available to humanity in current times, we might be able to report back a 10^6 growth compared to an older census. Having many files and handling that much data with very limited resources is one part of the equation that we (fortunately) had already solved back in 2014. The main challenge for 2015 was how to find the needles of relevant information inside a large haystack of public source code within a reasonable time. Even worse, how to enable end-users (customers) to find these needles by themselves inside the haystack through their laptops in offline manner, without a server farm somewhere (privacy).  However, we did managed to get the whole thing working. Fast. The critical test was a customer with over 10 million LOC in different languages, written for the past 15 years. We were in doubt about such a large code base. But running the triplecheck tooling from a normal i7 laptop to crunch the matches required only 4 days, compared to 11 days when compared to other tools with smaller databases. That was a few months ago, in 2016 we are aiming to reduce this value down to a single day of processing (or less). Impossible is only impossible until someone else makes it possible. Don't listen to naysayers, just take as many steps as you need to go up a mountain, no matter how big it might be.

Business-wise was quite a ride. The top worst decisions (mea culpa) in 2015 was pitching our company at a European-wide venture capital event and trusting completely on outsourced sales without preparation for either. The first decision wasn't initially bad. We went there and got 7 investors interested in follow-up meetings. Very honest about where the money would be used, along with expected growth. However, the cliché that engineers are not good at business might be accurate. Investors speak a different language, there was disappointment for both sides. This initiative costed our side thousands of euros in travel and material costs, along with 4 months of stalled development. Worse was believing that the outsourced team could deliver sales (without being asked for a proof or test beforehand). Investors can invest without proof of revenue, but when someone goes to market then they want to wait and see how it performs. In our case, it didn't perform. Many months later we had paid thousands of EUR to the outsourced company and had zero product revenue to account from them. Felt like a complete fool for permitting this to happen and not putting a brake earlier. The only thing saving the company at this point was our business angel. Thanks to his support we kept getting new clients for the consulting activities. Majority of these clients became recurring M&A customers, this is what kept the company floating. Can never thank him enough, a true business angel in the literal sense of the expression. By October, the dust from outsourcing and investing were gone. Now existed certainty that we want to build a business and not a speculative startup. We finally got product sales moving forward by bringing aboard a veteran on this kind of challenge. For a start, no more giving our tools away for free during trial phase. I was skeptic but it worked well because this filtered our attention for companies that would pay upfront a pilot test. This made customers take the trial phase seriously since it had a real cost paid by them. The second thing was to stop using powerpoints during meetings. I prepare slides before customer meetings but this is counter-productive. More often than not, customers couldn't care less about what we do. Surprisingly enough they care about what they do and how to get their own problems solved. :-) Today exists a focus on hearing more than speaking at such meetings. Those two simple changes made quite a difference.

So, that's the recap from last year. Forward we move. :-)

Linux Mint 17 with Windows 10 look

This weekend finally took the time to upgrade Windows 7 on my old laptop and try out that button on the system tray with the free Windows 10 install.

Was surprised, that was an old laptop from 2009 that came with the stock Windows 7 version and still worked fairly OK. Have to say that the new interface, which is indeed looking better and simpler. The desktop is enjoyable, but the fact that this Windows version beams up to Microsoft whatever I'm doing with on my own laptop is still a bother and a cold shill on the spine.

On my newer laptop I run Linux Mint. This is an old version installed back in 2013 and could really use an update. So, since it was upgrade-weekend I've decided to simply go ahead and bring up this Linux machine to a more recent version of Mint and see what had changed over the past years. While doing this upgrade, a question popped up: "how about adding the design of Windows 10 with Linux underneath, would it work?"

And this is the result:

The intention wasn't creating a perfect look-a-like, but (in my opinion) to try mixing and getting a relatively fresh looking design based on Windows, at the same time without opening hand from our privacy.

Operating System

I've got Linux Mint 17.2 (codename Olivia, Cinnamon edition for x64) downloaded from http://www.linuxmint...tion.php?id=197

Instead of installing to disk, this time I've installed and now run the operating system from a MicroSD card connected to the laptop through the SD reader using an SD adapter. The MicroSD is a Samsung 64Gb with advertised speed of 40Mb/s for read operations. Cost was ~30 EUR.

Installing the operating system followed the same routine steps as one would expect. There is a GUI tool from within Linux mint to write the DVD ISO into a pendisk connected on your laptop. Then boot from the USB and install the operating system on the MicroSD, having the boot entry added automatically.

Window 10 theme and icons

Now that the new operating system is running, we can start the customization.

The windows style you find on the screenshot can be downloaded from: http://gnome-look.or...?content=171327

This theme comes with icons that look exactly like Windows 10, but that wasn't looking balanced nor was our intention to copy pixel per pixel the icons. Rather, the intention was re-using the design guidelines. While looking for options, found Sigma Metro which resembled what was needed: http://gnome-look.or...?content=167327

If you look around the web, you'll find instructions on how to change the window themes and icons. Otherwise if you get into difficulties, just write me a message and I'll help.

Firefox update and customization

Install Ubuntu Tweaks. From there, go to Apps tab and install the most recent edition of Firefox because the one included on the distro is a bit old.

Start changing Firefox by opening it up and going to "Addons" -> "Get Addons". Type on the search box "Simple White Compact", this was the theme that I found the simplest and will change the browser looks, from icons to tab position as you can see on the screenshot. Other extensions that you might enjoy adding while making these changes are "Adblock Plus" to remove ads, "Tab Scope" to show miniatures when browsing tabs and "Youtube ALL HTML5" to force youtube running without using the Adobe Flash Player.

Office alternative and customization

Then we arrive to Office. I only keep that oldish laptop because it has the Adobe Reader (which I use for signing PDF documents) and Microsoft Office for the cases when I need to modify documents and presentations without getting them to look broken. So, I was prepared this time to run both apps using Wine (it is possible) but decided to first do an update on the alternatives and try using only Linux native apps. Was not badly surprised.

LibreOffice 4.x is included by default on the distro. Whenever I'd use it, my slides formatted in MS Office would look broken and unusable. Decided to download and try out version 5.x and to my surprise notice that these issues are gone. Both the slides and word documents are now properly displayed with just about the same results that I'd expected from Microsoft office. I'm happy.

To install LibreOffice 5.x visit https://www.libreoff...reoffice-fresh/

For the Linux edition, read the text document with instructions. Quite straightforward, just one command line to launch the setup. So, I was happy with LibreOffice as a complete replacement to Microsoft (no need to acquire licenses nor run office through Wine). However, those icons inside LibreOffice still didn't look good, they looked old. On this aspect the most recent version of Microsoft Office simply "looks" better. I wanted LibreOffice to look that way too. So, got icons from here: http://gnome-look.or...?content=167958

It wasn't straightforward to find out where the icons could be placed because the instructions for version 4.x no longer apply. To help you, the zip file with icons need to be placed inside:

Then you can open up "writer" and from the "Tools" -> "Options" -> "View" choose "Office2013" and get the new icons being used. The startup logo of LibreOffice also seemed too flashy and could be changed. So I've changed with the one available at http://gnome-look.or...?content=166590

Just a matter of overwriting the intro.png image found at:

Alternative to Adobe Reader for signing PDF

Every now and then comes a PDF that requires being printed, signed by pen and then scanned to send again to the other person. I stopped doing this kind of thing some time ago by adding a digital signature that includes an image of my handwritten signature on the document. This way there's no need to print nor scan any papers. Adobe Reader did a good work on this task but getting it to run on Wine with the signature function was not straightforward.

Started looking for a native Linux alternative and found "Master PDF Editor". The code for this software is not public but I couldn't find other options and these were the only ones that provided a native Linux install supporting digital handwritten signatures: https://code-industr...asterpdfeditor/

If you're using this tool for business, you need to acquire a license. Just for home-use is free of cost. Head out to the download page and install the app. I was surprised because it looked very modern, simple and customizable. I'll buy a license for this tool, does exactly what I needed. Having LibreOffice and MasterPDF as complete alternative to MS Office and Acrobat,  there is no more valid reason (on my case) to switch back the old laptop whenever editing documents. This can be done with same (or even better) quality from Linux now.

Command line

A relevant part of my day-to-day involves the use of command line. In Linux this is a relatively pleasant task because the terminal window can be adjusted, customized and never feels like a second class citizen inside the desktop. With these recent changes that were applied, was now possible to improve further the terminal window by showing the tool bar (see the screenshot).

Open a terminal, click on "View" -> "Show tool bar". Usually I'm against adding buttons, but that tool bar has a button for pasting clipboard text directly onto the console. I know that can be done by the keyboard using "Ctrl '+ Shift + V", but found it very practical to just click on a single button and paste the text.

Non-Windows tweaks

There are tweaks only possible on Linux. One of my favorite keeps being the "Woobly windows". Enable Compiz on the default desktop environment:

With Compiz there are many tweaks possible, I've kept them to a minimum but certainly is refreshing to use some animations rather than the plain window frames. If you never saw this in action, here is a video example:

Skype alternatives

Many of my friends and business contacts use Skype. It is not safe, it is not private, and I'd prefer to use a non-Microsoft service because the skype client gets installed on my desktop. Who knows what it can do on my machine when it is running on the background. One interesting alternative that I've found was launching the web-edition of skype that you find at

From firefox, there is the option to "Pin" a given tab. So I've pinned skype as you can see on the screenshot, and now opens automatically whenever the browser gets open, in practice bringing it online when I want to be reachable. A safe desktop client and alternative would be better, this is nowhere a perfect solution but rather a compromise that avoids installing the skype client.


There are more small tweaks happening to adjust the desktop for my case, but what is described above are the big blocks to help you reach this kind of design in case you'd like to do something similar. If you have any questions or get stuck at any part of customization, just let me know.

Have fun!

.ABOUT format to document third-party software

If you are a software developer, you know that every now and then someone asks you to create a list of the third-party things that you are using on some project.

This is a boring task. Ask any, single, motivated developer and try to find one that will not roll his eyes whenever asked to do this kind of thing. We (engineers) don't like it, yet are doomed to get this question every now and then. It is not productive to repeat the same thing over and over again, why can't someone make it simpler?

Waiting a couple of years didn't worked, so time to roll up the sleeves and find an easier way of getting this sorted. To date, one needs to list manually each and every portion of code that is not original (e.g. libraries, icons, translations, etc) and this will either end up on a text file or a spreadsheet (pick your poison).

There are ways to manage dependencies. Think of npm, maven and similar. However, you need to be using a dependency manager and this doesn't solve the case of non-code items. For example, when you want to list that package of icons from someone else, or just list dependencies that are part of the project, but not really part of the source code (e.g. servers, firewalls, etc).

For these cases, you still need to do things manually and it is painful. At TripleCheck, we don't like ourselves to do these lists so started looking into how to automate this step once for all. Our requirements: 1) simple, 2) tool-agnostic and 3) portable.

So we got inclined to the way how configuration files work because they are plain text files that are easy for humans to read or edit, and straightforward for machines to parse. We are big fans of SPDX because it permits describing third-party items in intrinsic detail, but a drawback of being so detailed is that sometimes we only have granular information. Example, we know that the files on a given a folder belong to some person and have a specific license (maybe we even know the version), but we don't want to compute the SHA1 binary signature for each and every file on that folder (either because the files might change often, or simply because it won't be done so easily and quickly by the engineer).

Turns out we we're not alone on this kind of quest. NexB had already pioneered in previous years a text format specifically for this kind of task, defining the ".ABOUT" file extension to describe third-party copyrights and applicable licenses:

The text format is fairly simple, here is an example we use ourselves:
name: jsTree
license_spdx: MIT
copyright: Ivan Bozhanov
version: 3.0.9

spec_version: 1.0
download_url: none

# when was this ABOUT file created or last updated?
date: 2015-09-14

# files inside this folder and sub-folders
about_resource: ./

Basically, it follows the SPDX license abbreviations to ensure we use a common way of talking about the same license and you can add or omit information as much as it is available. Take attention on the "about_resource" field that describes what is covered by this ABOUT file. When using "./" means all files and files in respective sub-folders.

One interesting point is the possibility for nesting of multiple ABOUT files. For example, place one ABOUT on the root of your project to describe the license terms generally applicable to the project and then create specific ABOUT on specific third-party libraries/items to describe what is applicable for such cases.

When done with the text file, place it on the same folder of what you want to cover. The "about_resource" can also be used for a single file, or repeated in several lines for covering a very specific set of files.

NexB made available tooling to collect ABOUT files and generate documentation. Unfortunately, this text format is not as known as it should be. Still, it fits like a glove as easy solution to list third-party software so we started using it for automating the code detection.

Our own TripleCheck engine is now supporting the recognition of .ABOUT files and adding this information automatically to the report generation. There is even a simple web frontend for creating .ABOUT files at

From that page, you can either create your own .ABOUT files or simply browse through the collection of already created files. The backend of that web page is powered by GitHub, you find the repository at

So, no more excuses to keep listing third-party software manually on spreadsheets.

Have fun! :-)

Something is cooking in Portugal

I don't usually write about politics, for me that is more often a never-ending discussion about tastes, rather than facts.

However, one senses a disturbance in the forces at Portugal. For the first time over the last (35?) years we see a change in landscape. For those non-familiar with Portuguese politics, the country is historically ruled by either one of the two large parties. Basically, one "misbehaves" and then comes the other to "repair". Vice-versa on next elections as voters grow anemic and disconnected from whomever gets elected.

This year wasn't the case. The ruling party is seen as "misbehaving" and the other party didn't got a majority, in other words, didn't convinced a significant part of the population to vote for them. This isn't unusual, what happened as different was the large number of votes going to other two minor parties and the fact that most citizens got up from their sofas to vote who "rules" them for the next years.

For the first time, I'm watching how the second largest party is now forced to negotiate with these smaller parties to reach an agreement. How since a long time they have to review what was promised during election time and get audited by other parties to ensure they keep what was promised.

In other words, for the first time watching what I'd describe as a realistic democratic process happening in our corner of Europe. Might seem strong words, but fact is that ruling a government by majority (in our context) is a carte blanche to rule over public interests. Go to Portugal, ask if they feel the government works on their behalf or against. Ask them for specific examples from recent years that support their claim, they quickly remember epic fights to prevent expensive airports from being built (Ota) by government or the extensive (and expensive) network of highways that got built with EU money and are today empty, still serving only the private interest of companies charging tolls on them.

There was (and still exists) a too-high level of corruption on higher instances of government (just look at our former prime-minister, recently in jail) or the current prime-minister (ask him about "tecnoforma" or about his friend "Dr. Relvas") and so exists a positive impact when small parties get higher voting representation, forcing the majority administrations to be audited and checked in public.

You see, most of this situation derives from a control of mind-share. In previous centuries you'd get support from local cities by promoting your party followers to administrative positions. Later came newspapers (which got tightly controlled), then radio (eventually regulated to forbid rogue senders), then TV (which to date has only two private channels and two state-owned channels) and now comes the Internet.

With the Internet there is a problem. The local government parties with majority are not controlling the platforms where people exchange their thoughts. Portuguese use facebook (hate or like it, that's what common families and friends use between them) and facebook couldn't (currently) care less about elections in Portugal, nor could either of the large parties have resources to make facebook biased to their interests. So what we have is a large platform where public news can be debunked as false or plain biased, where you can see how other citizens really feel about the current state of affairs, where smaller parties get a balanced chance to be read, heard and now even voted by people who support what they stand up for.

For the first time I see the Internet making a real difference in enabling people to be connected between themselves and enabling the population to collectively learn and change the course of their history, together. As for the Portuguese, you see the big parties worried that this thing of re-elections in automatic pilot is no longer assured. They too need to work together now. Portuguese, please do keep voting. For me this is democracy in action. Today I'm happy.

TripleCheck as a Top 20 Frankfurt startup to watch in 2015

Quite an honor and surprise, we got appointed with this distinction despite the fact that we don't see ourselves so much as a startup, but rather as a plain normal company worried about getting to the next month and growing with its own resources.

Looking back, things are much better off today than a year ago. Our schedule is busy at 150% of client allocation and we managed to survive through plain normal consulting, finally moving to product sales this year with a good market reception so far. Team grew, we finally have a normal office location and I keep worrying each month that the funds in the bank are not enough to cover expenses. Somehow, on that brink of failure or success we work hard to pay the bills and invest in material or people that permits moving a bit further each month.

It is not easy, this is not your dream story and we don't know what will happen next year. What I know is that we are pushed to learn more and grow. That kind of experience has a value of its own.

Next step for triplecheck is building in 2015 our own petabyte-level datacenter in Frankfurt. Efficiency of costs aside, we are building a safe-house outside of the "clouds" where nobody really knows who has access to them.

I wish it was time for vacations or celebrate, but this is not yet the time. I'm happy that together with smart and competent people we are building a stable company.

List of >230 file extensions in plain JSON format

I've collected over the last year some 230 file extensions and manually curated their descriptions so that whenever I find a file extension, it becomes possible to give the end-user a slight idea about what the extension is about.

Most of my code nowadays is written in Java but there is interest in porting some of this information to web apps. So I have exported a JSON list that you are welcome to download and use in your projects.

The list is available on GitHub at this link.

One thing to keep in mind is that I'm looking at extensions from a software developer perspective. This means that when the same extension is used for different programs, I usually favor the programs related to programming.

The second thing is that I collect more information about file extensions than the info you find on this JSON list. For example, I populate for each extension the applicable programming languages. Here is an example for .h source code files. Other values include information if the data is plain binary or text readable, the category to which the extension belongs (archive, font, image, sourcecode, ..) and other meta data values that are useful for file filtering and processing.

If you need help or would like to suggest something to improve the list, just let me know.

Updating the header and footer on static web sites using Java

This year was the first time that I've moved away from websites based on Wordpress, PHP and MySQL to embrace the simplicity of static HTML sites.

Simplicity is indeed a good reason. It means virtually no exploits as there is no database nor script interpretation happening. It means speed since there are no PHP, Java nor Ruby scripts running on the server and only direct files are delivered. The last feature that I was curious to try is the site hosting provided by Github, which is only supporting static web sites.

The first site to convert was the TripleCheck company site. It had been developed over a year ago and lagged a serious update. Was based on Wordpress and wasn't easy to make changes on the theme or content. The site was quickly converted and placed online using Github.

However, not all are roses with static websites. As you can imagine, one of the troubles is updating the text and links that you want to see on each page of the site. There are tools such as Jekyll that help to maintain blogs, but all that was needed here was a simple tool that would pick the header and footer tags to updated with whatever content was intended.

Easy enough, I've wrote a simple app for this purpose. You can download the binaries from this link and the source code is available at

How to get started?

Place the site_update.jar file inside the folder where your web pages are located. Then copy also the html-header.txt and html-footer.txt files and write inside the content you'd want to use as header and footer.

Inside the HTML pages that you want to change, you need to include the following tags:

Once you have this ready, from the command line run the jar file using:
java -jar site_update.jar

Check your HTML pages to see if the changes were applied.

What happens when it is running?

It will look for all HTML files with .html extension that are found on the same folder where the .jar file is located. For each HTML file it will look for the HTML tags that were mentioned above and replace whatever is placed between them, effectively updating your pages as needed.

There is an added feature. If you have pages on a sub-folder, this software will automatically convert the links inside the tags so that they keep working. For example, a link pointing to index.html will be modified to ../index.html and this way preserve the link structure. This is done also for images.

An example where this program used can be found at the TripleCheck website, whose code you find available on Github at

Feedback, new features?

I'd be happy to help. Just let me know on the comment box here or write a post on Github.

List of 310 software licenses in JSON format

I've recently needed a list of licenses to use inside a web page. The goal was presenting the end-user with a set of software licenses to choose from. However, couldn't find one readily available as a JSON or some kind of format to be embbeded as part of Javascript code.

So I've created such a list, based on the nice SPDX documentation. This list contains 310 license variations and types. I'm explicitly mentioning "types" because you will find licenses called "Proprietary" to define some sort of terms that are customized and a "Public domain" type, which is not a license per se but in practice denotes the lack of an applicable license since copyright (in theory) is not considered as applicable for them.

In case you are ok with these nuances, you can download this json list from

The list was not crafted manually, I've wrote a few lines of Java code to output the file. You find this file at

If you find the list useful and have feedback or need an updated version, just let me know.

SSDEEP in Java

If you are familiar with similarity hashing algorithms (a.k.a. fuzzy hash matching) and need an SSDEEP implementation in Java code, it is available directly from my Github account at this location:

The original page for SSDEEP can be found at

On that page you find also the binaries for Windows.

Have fun.