Machine Learning for Computer Security

Good and bad times with machine learning and security research

Fun and Pain with ZFS.

Recently, I found the time to experiment with ZFS – Sun's next-generation file system – using an external USB drive. In particular, I have been playing with ZFS to test whether it alleviates typical tasks of machine learning research, such as loading directories containing ten thousand of files or repeating experimental runs with large data chunks. As I am not running a Solaris system, I had to install the userland utilities and kernel modules for OSX (Leopard) available at Macforge. Note that Leopard natively supports read-only access to ZFS pools but lacks write functionality.

Apart from great read access time, the first interesting issue I noticed is ZFS's ability to create hierarchical file systems on-the-fly. Instead of dumping all contents into a single volume, the hierarchical layout enables fine-grained control of different experimental data sets and allows for assigning quota and options individually per data. For example, the ability to easily split data comes handy, if one enables the transparent compression in ZFS. This snippet of commands creates two file systems named tank/dataset1 and tank/dataset2 where uses automatic compression.
% zfs create tank/dataset1
% zfs create tank/dataset2
% zfs set compress=on tank/dataset1
% zfs set compress=off tank/dataset2
Compression might not be desired when running certain experiments. Thus, it can be disabled and enabled per file system, such that archived data is stored effectively while the current workbench is accessible with full processing power. A nice feature for working with experimental data. The compression ratio of each file system can be queried using the following command.
% zfs get compressratio tank
NAME PROPERTY VALUE SOURCE
tank/dataset1 compressratio 3.14x -
tank/dataset2 compressratio 1.00x -
Another interesting issue is ZFS's ability to store snapshots of file systems. Initially, a snapshot does not consume any memory as only the differences to the original version are stored. If one is working with multiple copies of the same data, say one version described in a publication and a refined variant, snapshots are a great tool, as they allow one to quickly jump back and forth between different versions of data. One can access the content of each snapshot using the directory .zfs in the root of the considered file system. Here's an example how a snapshot is created for a given data set.
% zfs snapshot tank/dataset1@paper
As it can see from a call to zfs list there is no memory associated with the snapshot directly after creation and thus no storage is wasted.
% zfs list
tank/dataset1 2.38G 222G 2.38G /Volumes/tank/dataset1
tank/dataset1@paper 0 - 2.38G -
This feature is really nifty if one needs to "freeze" the state of experimental data if a paper is accepted for publication while still continuing to work with the data set.

Unfortunately, all my enthusiasm and great plans to work with ZFS have been eliminated by the instability of the current Leopard driver. The driver does not really handle USB devices. If the device is accidentally removed or the computer falls asleep, the ZFS module simply crashes the kernel. I even managed to corrupt the ZFS partition such that any call to zpool status issues a kernel panic. Game over.