Machine Learning for Computer Security

Good and bad times with machine learning and security research

Efficient Machine Learning with Trees.

Trees are a natural and intuitive representation of data in many domains of computer science. Although ubiquitous, analysis of tree data is far from trivial. Most techniques of machine learning are confined to operate in vector spaces and lack the ability to process trees. A solution to this problem is provided by tree kernels which allow for application of kernel-based learning methods to tree data. Unfortunately, the run-time complexity of tree kernels is inherently quadratic in the number of nodes. While small trees can be processed with minor overhead, learning with larger trees gets intractable. For example, the computation of a regular kernel for two parse trees of HTML documents comprising 10,000 nodes each, requires about 1 Gigabyte of memory and takes over 100 seconds on a recent computer system. Given that kernel computations are performed millions of times in large-scale learning, it is evident that regular tree kernels are an inappropriate choice in many learning tasks.

As a result, we have often abstained from using tree data in our research. However, we recently found the time to address this problem and devised approximate tree kernels. Instead of fiddling with algorithmic issues and implementations, we propose to approximate the computation of kernel functions for trees. To this end, we narrow the kernel computation to a sparse subset of subtrees rooted at relevant symbols and thereby avoid considering all possible subtrees as in regular tree kernels. This sparse subset is automatically selected with respect to a given learning task, such that the expressiveness of the approximate kernel is preserved while its run-time is reduced. Though simple in design, this approximation enables speed-ups up to three orders of magnitude and, for the first time, allows to compare HTML documents for detection of Web spam efficiently.

The concept of approximate tree kernels as well as several applications are described in a recent article published in the Journal of Machine Learning Research. The abstract for the article is given below:
Convolution kernels for trees provide simple means for learning with tree-structured data. The computation time of tree kernels is quadratic in the size of the trees, since all pairs of nodes need to be compared. Thus, large parse trees, obtained from HTML documents or structured network data, render convolution kernels inapplicable. In this article, we propose an effective approximation technique for parse tree kernels. The approximate tree kernels (ATKs) limit kernel computation to a sparse subset of relevant subtrees and discard redundant structures, such that training and testing of kernel-based learning methods are significantly accelerated. We devise linear programming approaches for identifying such subsets for supervised and unsupervised learning tasks, respectively. Empirically, the approximate tree kernels attain run-time improvements up to three orders of magnitude while preserving the predictive accuracy of regular tree kernels. For unsupervised tasks, the approximate tree kernels even lead to more accurate predictions by identifying relevant dimensions in feature space.
Approximate Tree Kernels. Konrad Rieck, Tammo Krueger, Ulf Brefeld and Klaus-Robert Müller. Journal of Machine Learning Research (JMLR), 11(Feb):555−580, Microtome, 2010.

Call for Papers: EC2ND 2010.

As program chair, I am happy to announce the sixth European Conference on Computer Network Defense (EC2ND 2010) in Berlin, Germany. The conference brings together researchers from academia and industry within Europe and beyond to present and discuss current topics in applied network and systems security.

EC2ND 2010 invites submissions presenting novel ideas in the areas of network defense, intrusion detection and systems security. Topics for submission include but are not limited to:
  • Intrusion Detection
  • Malicious Software
  • Web Security
  • Security Policy
  • Peer-to-Peer and Grid Security
  • Wireless and Mobile Security
  • Network Forensics
  • Network Discovery and Mapping
  • Incident Response and Management
  • Privacy Protection
  • Cryptography
  • Legal and Ethical Issues
A detailed call for papers is available here. Note the following important dates:
  • Paper submission deadline: July 2, 2010
  • Paper acceptance or rejection: August 6, 2010
  • Final paper camera ready copy: August 13, 2010
  • Conference dates: October 28–29, 2010
I am looking forward to interesting contributions and an awesome conference.

Malheur is out!

After almost a year of work, I am proud to announce the first public release of Malheur—a tool for automatic analysis of program behavior recorded from malicious software (www.mlsec.org/malheur).

Malicious software (malware) is one of the major threats in the Internet today and millions of hosts are currently infected with malware programs, such as computer worms, backdoors and trojan horses. The sheer amount of malware renders manual analysis of malicious files impossible and, even worse, automatic inspection of file content is strongly obstructed by obfuscation techniques. An alternative for efficiently crafting new defenses against malware is behavior-based analysis: Malware programs are collected in the wild and executed in a sandbox environment, where their behavior is monitored. The execution of each binary results in a report of monitored behavior which can be used to characterize and ultimately defend against malicious software.

As the first publicly available tool, Malheur analyzes program behavior of malicious software and enables automatic discovery and discrimination of novel variants. This ability is rooted in well-known concepts of machine learning. Discovery of novel malware classes resembles a clustering problem and discrimination between classes matches a classification task. Both techniques are implemented in Malheur with effectivity and efficiency in mind. Instead of fiddling with esoteric learning concepts, Malheur resorts to basic methods, such as linkage clustering and nearest-prototype classification, which are efficiently implemented by means of parallel programming. Expressive access to recorded behavior is realized by embedding reports in a vector space spanned by short behavioral patterns, similar in spirit to bag-of-words models.

Malheur is a joint effort of Berlin Institute of Technology and University of Mannheim. The merits of Malheur along with an empirical evaluation using real malware are detailed in a technical report:

Automatic Analysis of Malware Behavior using Machine Learning. Konrad Rieck, Philipp Trinius, Carsten Willems and Thorsten Holz. Technical Report 18-209, Berlin Institute of Technology, 2009.

My Thesis as a Paperback.

It may sound a little odd, but when I submitted my thesis to the library of my university, I was missing something. Back then, all I had to do was to send in some PDF file and fill out some form. Had the outcome of several months of hard work been yet another PDF document?

Recently, I started to alleviate this strange pain by publishing my thesis as a print-on-demand book at lulu.com. The process is pretty straightforward and, given that most of us work with latex, took only a couple of hours to adapt the sources. Needless to say, that I felt much better, once I held the first paperback version of the thesis in my hands. And, if you are truly interested in reading this work, you can also grab a paperback copy for only 9.99€.

Of course, if you aim for true scientific glory, it might be more appropriate to hunt for a renowned publisher, convince the editors, struggle with presentation and content requirements and finally receive a real book at about 99€.

Visualization of Payload-based Anomaly Detection.

Two paradigms for detection of attacks against computer systems have been widely studied in security research: First, misuse detection which aims at identifying known patterns of misuse and, second, anomaly detection which tries to detect deviations from normal usage. While both concepts have their individual pros and cons, only one of the two paradigms, namely misuse detection, has made its way into regular security products. For example, almost all anti-virus scanners and intrusion detection systems employ a database of known misuse patterns for spotting security problems. Although successful on the market, misuse detection inherently fails to protect from novel threats, such as zero-day exploits. In turn, anomaly detection methods provide means to identify unknown threats with high precision and low run-time overhead (see for instance work by Cretu, Ingham, Perdisci or myself). So, why is there still a lack of acceptance for anomaly detection in practice?

One key problems with most anomaly detection approaches is their inability to explain decisions. The detection process resembles a black-box and security operators are required to thoroughly analyze context information to assess the actual cause of a reported anomaly. We have addressed this problem in a recent paper published at the European Conference on Computer and Network Defense (EC2ND). In particular, we present visualization techniques suitable for explaining the decisions of payload-based anomaly detection systems. Instead of digging into the technical details, I herein present an example for explaining the detection of a real network attack. An in-depth discussion is provided in our paper.




The above figures shows so called feature differences of a command injection attack (awstats configdir exploit), where peaks indicate string features that strongly deviate from a model of normal network traffic. The attack exploits an insecure handling of input parameters to pass shell commands to an HTTP server. The transferred commands are mapped to the standard URI scheme, which replaces reserved characters by the symbol “%” and an hexadecimal value. For example, “%20” denotes a space symbol, “%3b” a semi-colon, “%26” an ampersand and “%27” an apostrophe. In particular, the semi-colon and ampersand are characteristic for shell commands as they reflect specific semantics of shell syntax .In the presented visualization exactly these patterns are represented as high peaks. While similar string patterns can be also observed in legitimate traffic, the high differences in the figure clearly indicate an anomalous activity involving escaped shell commands and explain the reason for reporting of an anomaly.




Another visualization technique, feature shading, is presented in the
second figure, where the payload of a reported anomaly is superimposed with color reflecting the individual deviation of each byte from a model of normal network traffic. The URI of the command injection attack is flagged as anomalous by dark shading, thus indicating the presence of abnormal strings. The part ensuing the URI, however, is indicated as normal region, as it mainly contains frequent HTTP patterns, such as “Mozilla” and “Googlebot”. This example demonstrates the ability of a shading to emphasize anomalous contents in network payloads, while also indicating benign regions and patterns.

By visualizing a “colorful” network payload a security operator is able to quickly identify relevant and malicious content in data, eventually enabling effective countermeasures. Consequently, the decisions made by a payload-based detection system – so far opaque to a security operator – can now be visually explained, such that one can benefit from early detection of novel attacks as well as an explainable detection process.

Visualization and Explanation of Payload-Based Anomaly Detection. Konrad Rieck and Pavel Laskov. Proceedings of 5th European Conference on Computer and Network Defense (EC2ND).

Detecting the "Phoning Home" of Malicious Software.

Malicious software poses a severe threat to security of computer systems. Whether you download a file, plug in the USB stick of a colleague or simply surf the Web, your computer is always at risk to be compromised by malicious software, such as computer worms, backdoors or Trojan horses. Once the evil has infiltrated your system, it usually initiates a process referred to as "phoning home": The malicious software contacts its author and hands over control of your computer to him, for example for sending spam messages or conducting a distributed flooding attack. Unfortunately, regular security tools, such as anti-virus scanners, increasingly fail to protect the many infection vectors of malicious software and thus users are often left alone with systems "phoning home" to bad people.

In our latest research (to be published at the 25th ACM Symposium on Applied Computing) we address this problem and introduce Botzilla, a method for automatically detecting the "phoning home" of malicious software. Botzilla operates by first collecting malicious software in the wild using honeypots. The malicious software is then repetitively executed in a controlled environment and its communication is recorded— similar to a rat in a lab. Invariants communication patterns, such as byte strings used for handshaking and remote control, are extracted and assembled to detection signatures using a naive-Bayes classification scheme. As a result, Botzilla is able to automatically generate signatures for malicious software within minutes and allows to counteract the propagation of evil in the first round. An abstract for this work is here:
Hosts infected with malicious software, so called malware, are ubiquitous in today's computer networks. The means whereby malware can infiltrate a network are manifold and range from exploiting of software vulnerabilities to tricking a user into executing malicious code. Monitoring and detection of all possible infection vectors is intractable in practice. Hence, we approach the problem of detecting malicious software at a later point when it initiates contact with its maintainer; a process referred to as "phoning home". In particular, we introduce Botzilla, a method for detection of malware communication, which proceeds by repetitively recording network traffic of malware in a controlled environment and generating network signatures from invariant content patterns. Experiments conducted at a large university network demonstrate the ability of Botzilla to accurately identify malware communication in network traffic with very low false-positive rates.

Botzilla: Detecting the "Phoning Home" of Malicious Software. Konrad Rieck, Guido Schwenk, Tobias Limmer, Thorsten Holz and Pavel Laskov. Proceedings of 25th ACM Symposium on Applied Computing (SAC).

The work on Botzilla is a small yet successful effort of Berlin Institute of Technology, Fraunhofer FIRST, University of Erlangen, Technical University Vienna and University of Tübingen.

Active Learning for Network Intrusion Detection.

Two paradigms for application of machine learning to security are prevalent: First, supervised learning as employed in spam filtering and, second, unsupervised learning as applied for network anomaly detection. Supervised learning methods construct models for data using label information attached to each data object. For example, email messages tagged as spam and non-spam are used to learn a discriminative model for spam filtering. In contrast, unsupervised learning methods operate on given data only and do not make use label information. For example, models for detection anomalous network payloads are usually learned from unlabeled network traffic without the need to label million of network payloads.

Besides these two paradigms, however, learning theory also provides the hybrid concept of semi-supervised learning. Although technically more involved, semi-supervised methods combine the best of the classic learning paradigms: The majority of training data can be unlabeled, whereas only few instances need to be equipped with label information for learning. Semi-supervised methods are more accurate than unsupervised methods while not suffering from the problem of labeling large amounts of data. Unfortunately, the security community has largely ignored semi-supervised learning and, consequently, often argues for either one of the two classic learning paradigms.

In a first study, we have applied the concept of semi-supervised learning to network security and devised an learning method for network intrusion detection. Our method initially operates on unlabeled data as most of previous learning approaches to intrusion detection. However, the devised method then specifically requests label information for certain network events to improve the learning model. For example, our method requests labels for points that lie close to the decision boundary and thus may sharpen the detection accuracy. With only a minimal labeling effort a security operator can tune our method to particular network data and eliminate false-positive alarms that come with the majority of regular detection approaches. A paper of this work has been recently
accepted at AISEC 2009. Here is its abstract:
Anomaly detection for network intrusion detection is usually considered an unsupervised task. Prominent techniques, such as one-class support vector machines, learn a hypersphere enclosing network data, mapped to a vector space, such that points outside of the ball are considered anomalous. However, this setup ignores relevant information such as expert and background knowledge. In this paper, we rephrase anomaly detection as an active learning task. We propose an effective active learning strategy to query low-confidence observations and to expand the data basis with minimal labeling effort. Our empirical evaluation on network intrusion detection shows that our approach consistently outperforms existing methods in relevant scenarios.

Active Learning for Network Intrusion Detection. Nico Görnitz, Marius Kloft, Konrad Rieck and Ulf Brefeld. Proceedings of CCS Workshop on Security and Artificial Intelligence (AISEC), October 2009.