The hidden overhead when creating thousands of empty folders

Over the past weeks I've been looking for a way to store some hundred million files inside a desktop file system such as Ext3 or NTFS.

On previous experiments, I've noted that after >30 000 files on a single folder it would not be possible to use a GUI file browser and even the listing of these files using the command line becomes sluggish.

The next hypothesis was to avoid a huge number of files/folders on the same root. For example, for a file I'd remove the name portion and compute a SHA1 checksum signature of 40 bytes. With this signature I'd use the first 4 bytes to create a folder, then I'd use the next 4 bytes to create a sub-folder inside the first one and so forth until I have 10 folders that represent the unique signature of the file.

On the last folder I would add a text file containing the path and name information. If more files would be found in the future, the respective path information would be added on this text file. I thought that using the 4 bytes would reduce the number of possible combinations and permit re-use of the folders that would scale to millions.

In theory it worked OK. On my first tests it worked OK with the first thousand files. The problem was a bit more unexpected. Folders, despite containing nothing other than other folders will also require disk space to exist. While the initial project was a success, we could store/access thousands of files without noticeable latency, we brought aboard a problem of disk storage.

To index something like 40 000 files with this SHA1/folder combination was using 2.2Gb of disk space. This was a hidden overhead, very underestimated on my early estimation.

I didn't tested under the Windows NTFS, having this problem under Linux was already a no-go for this option. Fortunately, on the bench was already being planned an alternative.

Our file-system (at minimum) requires only to write files once and then read them when needed. So, a very simple solution would be writing all files together to create a very large file. And then have a second file indicating the position of each file within this large binary block.

There are some problems with this approach:
  • Easy to corrupt. If one byte within the large block is removed/added then it invalidates the index structure completely
  • Error checking overhead. Right now it copies files directly onto the big binary block (BBB). If an exception occurs while copying a file, this leads to corruption of the whole big file.
  • Limited to operating systems capable of handling files bigger than 4Gb
  • (..more exist, just pointing the ones most troublesome at the moment) 
Still, as a storing solution this is so far working as desired. Helped to remove the folder overhead and is extremely fast. I'm writing the index file as the coordinates of the file within the big file along with its original path/name combination. An example can be found at https://raw.githubusercontent.com/triplecheck/tdf/57a776d7ea065b686cb024f1dc924b5b06f8752d/run/test.tdf-index

As far as I'm concerned, creating millions of files on a desktop-grade file system is something I won't be pursuing further. It is tempting to have all the files ready for processing in their original format but I'm simply not finding a feasible manner of accessing millions of files in their original form without resorting to file systems such as XFS and HDFS, or resorting to some kind of data base.

So, opting with a big-binary file as simple storage for now.