In quest for a remedy, I have experimented with OpenMP (Open Multi-Processing) and libarchive, where the first is a simple API for multi-processing programming in C and the latter a library for reading and writing of file archives, such as zip, tar and on. On the one hand OpenMP enables loading of data and extraction of features in parallel, whereas on the other hand libarchive allows for storing the data efficiently in compressed archives in favor of directories.
The figure shows some preliminary run-time measurements for extraction of feature vectors from malware reports. The application of multi-processing clearly accelerates the feature extraction, independent of the applied data format. For example, when reading from a directory the performance is doubled if two threads are used and enables processing up to 100 files per second (note this experiments was run on a dual-core machine). Surprisingly, the extraction performance is also high when using compressed archives. There is almost no difference between feature extraction from a zip/gz archive and a plain directory. Moreover, the zip/gz archive consumes only 5% of the original space and considerably reduces the amount of required storage. That's impressive. If you are dealing with loading and processing thousands of files, these tuning hacks might be an interesting option.