Machine Learning for Computer Security

Good and bad times with machine learning and security research

Some Tuning for Feature Extraction.

One key to efficient application of learning methods in practice is fast extraction of features from raw data. As an example, I am currently working on methods for automatic analysis of malware, where the behavior of a malware binary is represented in a textual report and mapped to a vector space using frequencies of contained substrings (see here). However, thousands of binaries need to be processed and often the generated report files are huge. Consequently, the task of designing analysis methods gets tedious, as one has to wait several minutes just to load data and extract appropriate feature strings.

In quest for a remedy, I have experimented with OpenMP (Open Multi-Processing) and libarchive, where the first is a simple API for multi-processing programming in C and the latter a library for reading and writing of file archives, such as zip, tar and on. On the one hand OpenMP enables loading of data and extraction of features in parallel, whereas on the other hand libarchive allows for storing the data efficiently in compressed archives in favor of directories.





The figure shows some preliminary run-time measurements for extraction of feature vectors from malware reports. The application of multi-processing clearly accelerates the feature extraction, independent of the applied data format. For example, when reading from a directory the performance is doubled if two threads are used and enables processing up to 100 files per second (note this experiments was run on a dual-core machine). Surprisingly, the extraction performance is also high when using compressed archives. There is almost no difference between feature extraction from a zip/gz archive and a plain directory. Moreover, the zip/gz archive consumes only 5% of the original space and considerably reduces the amount of required storage. That's impressive. If you are dealing with loading and processing thousands of files, these tuning hacks might be an interesting option.

It's finally done.

After a long period of hard and boring work, I finally submitted my Ph.D. thesis to the computer science faculty at Berlin Institute of Technology (TU Berlin). The thesis is entitled "Machine Learning for Application-Layer Intrusion Detection" and refereed by Klaus-Robert Müller, John McHugh and Pavel Laskov.

In my thesis, I tackle the problem of detecting unknown and novel attacks in the application layer of network communication and present a machine learning framework for intrusion detection. In particular, I propose a generic technique for embedding of network payloads in vector spaces such that features extracted from the payloads are accessible to statistical and geometric analysis. Efficient learning in these high-dimensional vector spaces is realized using the concept of kernel functions defined over network payload data. Based on these functions, I derive methods for anomaly detection suitable for identification of unknown attacks, where normality of network data is modeled using geometric concepts such as hyperspheres and neighborhoods.

The framework is empirically evaluated using 10 days of HTTP and FTP network traffic and over 100 real attacks unknown to the applied learning methods. A prototype of the framework outperforms related methods from Kruegel et al. (2002), Wang et al. (2006) and Ingham et al. (2007), where it identifies 80–97% unknown attacks with less than 0.002% false positives. Moreover, reasonable throughput rates between 20–60 Mbit/s are attained, though no special hardware acceleration is yet utilized.

As the thesis is under review, I will not provide an online version now. However, I am going to present some interesting results on visualization of detected attack payloads in later posts. Stay tuned.