MySQL as option for large scale file repository
Further improvements were made. There was interest in getting files onto the server so that later it would be possible to perform checksum analysis on them.
Initially I was pondering to store all files inside the file system. Some time ago I had already tried to store a few hundred thousand files under the same folder, however the result was bad. After reading more about the matter, I could use a more specific file system such as XFS or go all the way with Hadoop over the existing file systems to compute hashes over the existent repository.
Both these options were good, however came disappointing upon implementation stage. Setting up a XFS from a remote server would cause offline time and bring risks of disk space shortage as I would have to allocate a large portion of the EXT3 file system.
Running Hadoop seemed nice from all the papers and articles that I was reading, however, it would force all files to be placed inside a clustered file system that would still be out of reach or require specific interaction. At this point I was neither happy with XFS nor Hadoop.
So, since MySQL is performing so well, why not give it a go at database storage?
At first I would be fast to claim that file systems are faster than databases at any given day of the week. However, what is the use of speed if then you lose time (and hairs) to transverse quickly through all the files?
I decided to give a change to MySQL and started uploading files directly to the database. As result: I am happy!
This way I am adding all the files under a normalized table and can perform queries to select files added on a specific date, with a specific size or mime type. Above all that, I don't have to deal with a special file system and can keep all data together.
Not everything is perfect for sure. One might complain about the need to extract files from MySQL onto somewhere in order to apply an algorithm. However, think it this way: now we can also connect directly to the database and it will provide results faster than a typical file system. On top of that, we don't have to manage two system, just one that can one day be clustered away on some cloud environment if desired.
Tradeoffs for sure and so far I am happy with this decision. If later we need a file system, we can still have one. If we want specific collections of files, it will be a breeze to put them together using MySQL.
Posted by Max Brito