Machine Learning for Computer Security

Good and bad times with machine learning and security research

Second Call for Papers: EC2ND 2010.

Only three weeks to go: The sixth European Conference on Computer Network Defense (EC2ND) invites submissions presenting novel ideas in the areas of network defense, intrusion detection and systems security. We specifically encourage submissions presenting work at an early stage with the intention to act as a discussion forum for innovative security research.

Paper submission deadline: July 2, 2010
Paper acceptance or rejection: August 6, 2010
Conference dates: October 28-29, 2010

More information on EC2ND 2010 and paper submission are available at the conference website. Additionally, you can also follow on twitter: ec2nd.

TokDoc: The Token Doctor.

Detecting and preventing network intrusions is a basic task in computer security. Thus, it no surprise that there exists several security products capable to block network attacks, though with moderate success and restricted to a database of known attack patterns. For a security researcher two issues with these systems are not satisfactory. First, current intrusion prevention systems rely on a up-to-date database of attack signatures. Novel and unknown attacks will likely go undetected. Second, network traffic is either passed or completely blocked—often with fatal consequences and cut-off of benign communication.

In recent years, security research has focused on addressing the first issue by extending intrusion detection systems with techniques of statistics and machine learning. The resulting systems proof effective in many scenarios. Still, they build on a binary pass-or-block decision on every analyzed event. What if we question this rigid decision making? Together with Tammo Krueger, Christian Gehl and Pavel Laskov, we have dared to go one step beyond. In a recent paper, we propose a web-application firewall that is not only capable to detect network attacks, but provides means to "heal" abnormal content to some degree. We call it the Token Doctor.

The Token Doctor, or short TokDoc, acts as a reverse proxy and inspects incoming requests of HTTP traffic. Each requests is parsed into its individual tokens, such as the URL, parameters, headers and so on. Each tokens is then analyzed individually using anomaly detection techniques. If an anomaly is spotted in a token, a fine-grained process is launched. Some tokens, such as usernames and cookie values, can not be corrected, hence there are simply dropped from the request as with usual prevention systems. Other tokens, however, are "healed" by replacing them with benign content. This replacement is automatically selected by determining similar tokens in a pool of benign HTTP requests.

The "healing" implemented in TokDok is not guaranteed to succeed. Frankly speaking, automatic amending of network data is a controversial idea and one can image a lot of things going wrong. Nevertheless, our experiments demonstrate the opposite. TokDoc provides an excellent detection accuracy while significantly reducing false positives in comparison to state-of-the-art methods. For example, several benign requests, which would been dropped by a regular systems due to minor irregularities, are slightly amended without effects on functionality. Overall, there is room for discussion: Either stick to a rigid decision with painful blocking or choose a fuzzy amending with lots of promises but no guarantees. I don't know.

A corresponding paper has been published at the 24th ACM Symposium on Applied Computing (SAC) this year. Its abstract is here:
The growing amount of web-based attacks poses a severe threat to the security of web applications. Signature-based detection techniques increasingly fail to cope with the variety and complexity of novel attack instances. As a remedy, we introduce a protocol-aware reverse HTTP proxy TokDoc (the token doctor), which intercepts requests and decides on a per-token basis whether a token requires automatic "healing". In particular, we propose an intelligent mangling technique, which, based on the decision of previously trained anomaly detectors, replaces suspicious parts in requests by benign data the system has seen in the past. Evaluation of our system in terms of accuracy is performed on two real-world data sets and a large variety of recent attacks. In comparison to state-of-the-art anomaly detectors, TokDoc is not only capable of detecting most attacks, but also significantly outperforms the other methods in terms of false positives. Runtime measurements show that our implementation can be deployed as an inline intrusion prevention system.
TokDoc: A Self-Healing Web Application Firewall. Tammo Krueger, Christian Gehl, Konrad Rieck and Pavel Laskov. Proc. of 25th ACM Symposium on Applied Computing (SAC), 1846-1853, March 2010.

Call for Papers: SICHERHEIT 2010

I am happy to serve on the program committee of the conference "Sicherheit, Schutz und Zuverlässigkeit" (SICHERHEIT) which will take place in October 2010 in Berlin. SICHERHEIT is the German forum for experts from academia and industry to discuss aspects of safety (protection from catastrophic events of technical systems) and security (protection of confidentiality and integrity of information in technical systems).

The organizers seek submissions from the broad range of safety and security, for example on the following topics:
  • biometrics, privacy, data protection
  • e-commerce, e-government
  • certification of safe and secure systems
  • cryptography, digital signatures, steganography
  • development and maintenance of safe and secure systems
  • formal methods for safe and secure systems
  • management of safe and secure systems
  • management of information security
  • network security
  • reactive security
  • reliability and availability
The conference language is English and SICHERHEIT welcomes participants and submissions from non-German-speaking countries. More information can be found at www.sicherheit2010.de. Deadline for paper submissions is March, 29th. So hurry up!

Efficient Machine Learning with Trees.

Trees are a natural and intuitive representation of data in many domains of computer science. Although ubiquitous, analysis of tree data is far from trivial. Most techniques of machine learning are confined to operate in vector spaces and lack the ability to process trees. A solution to this problem is provided by tree kernels which allow for application of kernel-based learning methods to tree data. Unfortunately, the run-time complexity of tree kernels is inherently quadratic in the number of nodes. While small trees can be processed with minor overhead, learning with larger trees gets intractable. For example, the computation of a regular kernel for two parse trees of HTML documents comprising 10,000 nodes each, requires about 1 Gigabyte of memory and takes over 100 seconds on a recent computer system. Given that kernel computations are performed millions of times in large-scale learning, it is evident that regular tree kernels are an inappropriate choice in many learning tasks.

As a result, we have often abstained from using tree data in our research. However, we recently found the time to address this problem and devised approximate tree kernels. Instead of fiddling with algorithmic issues and implementations, we propose to approximate the computation of kernel functions for trees. To this end, we narrow the kernel computation to a sparse subset of subtrees rooted at relevant symbols and thereby avoid considering all possible subtrees as in regular tree kernels. This sparse subset is automatically selected with respect to a given learning task, such that the expressiveness of the approximate kernel is preserved while its run-time is reduced. Though simple in design, this approximation enables speed-ups up to three orders of magnitude and, for the first time, allows to compare HTML documents for detection of Web spam efficiently.

The concept of approximate tree kernels as well as several applications are described in a recent article published in the Journal of Machine Learning Research. The abstract for the article is given below:
Convolution kernels for trees provide simple means for learning with tree-structured data. The computation time of tree kernels is quadratic in the size of the trees, since all pairs of nodes need to be compared. Thus, large parse trees, obtained from HTML documents or structured network data, render convolution kernels inapplicable. In this article, we propose an effective approximation technique for parse tree kernels. The approximate tree kernels (ATKs) limit kernel computation to a sparse subset of relevant subtrees and discard redundant structures, such that training and testing of kernel-based learning methods are significantly accelerated. We devise linear programming approaches for identifying such subsets for supervised and unsupervised learning tasks, respectively. Empirically, the approximate tree kernels attain run-time improvements up to three orders of magnitude while preserving the predictive accuracy of regular tree kernels. For unsupervised tasks, the approximate tree kernels even lead to more accurate predictions by identifying relevant dimensions in feature space.
Approximate Tree Kernels. Konrad Rieck, Tammo Krueger, Ulf Brefeld and Klaus-Robert Müller. Journal of Machine Learning Research (JMLR), 11(Feb):555−580, Microtome, 2010.

Call for Papers: EC2ND 2010.

As program chair, I am happy to announce the sixth European Conference on Computer Network Defense (EC2ND 2010) in Berlin, Germany. The conference brings together researchers from academia and industry within Europe and beyond to present and discuss current topics in applied network and systems security.

EC2ND 2010 invites submissions presenting novel ideas in the areas of network defense, intrusion detection and systems security. Topics for submission include but are not limited to:
  • Intrusion Detection
  • Malicious Software
  • Web Security
  • Security Policy
  • Peer-to-Peer and Grid Security
  • Wireless and Mobile Security
  • Network Forensics
  • Network Discovery and Mapping
  • Incident Response and Management
  • Privacy Protection
  • Cryptography
  • Legal and Ethical Issues
A detailed call for papers is available here. Note the following important dates:
  • Paper submission deadline: July 2, 2010
  • Paper acceptance or rejection: August 6, 2010
  • Final paper camera ready copy: August 13, 2010
  • Conference dates: October 28–29, 2010
I am looking forward to interesting contributions and an awesome conference.

Malheur is out!

After almost a year of work, I am proud to announce the first public release of Malheur—a tool for automatic analysis of program behavior recorded from malicious software (www.mlsec.org/malheur).

Malicious software (malware) is one of the major threats in the Internet today and millions of hosts are currently infected with malware programs, such as computer worms, backdoors and trojan horses. The sheer amount of malware renders manual analysis of malicious files impossible and, even worse, automatic inspection of file content is strongly obstructed by obfuscation techniques. An alternative for efficiently crafting new defenses against malware is behavior-based analysis: Malware programs are collected in the wild and executed in a sandbox environment, where their behavior is monitored. The execution of each binary results in a report of monitored behavior which can be used to characterize and ultimately defend against malicious software.

As the first publicly available tool, Malheur analyzes program behavior of malicious software and enables automatic discovery and discrimination of novel variants. This ability is rooted in well-known concepts of machine learning. Discovery of novel malware classes resembles a clustering problem and discrimination between classes matches a classification task. Both techniques are implemented in Malheur with effectivity and efficiency in mind. Instead of fiddling with esoteric learning concepts, Malheur resorts to basic methods, such as linkage clustering and nearest-prototype classification, which are efficiently implemented by means of parallel programming. Expressive access to recorded behavior is realized by embedding reports in a vector space spanned by short behavioral patterns, similar in spirit to bag-of-words models.

Malheur is a joint effort of Berlin Institute of Technology and University of Mannheim. The merits of Malheur along with an empirical evaluation using real malware are detailed in a technical report:

Automatic Analysis of Malware Behavior using Machine Learning. Konrad Rieck, Philipp Trinius, Carsten Willems and Thorsten Holz. Technical Report 18-209, Berlin Institute of Technology, 2009.

My Thesis as a Paperback.

It may sound a little odd, but when I submitted my thesis to the library of my university, I was missing something. Back then, all I had to do was to send in some PDF file and fill out some form. Had the outcome of several months of hard work been yet another PDF document?

Recently, I started to alleviate this strange pain by publishing my thesis as a print-on-demand book at lulu.com. The process is pretty straightforward and, given that most of us work with latex, took only a couple of hours to adapt the sources. Needless to say, that I felt much better, once I held the first paperback version of the thesis in my hands. And, if you are truly interested in reading this work, you can also grab a paperback copy for only 9.99€.

Of course, if you aim for true scientific glory, it might be more appropriate to hunt for a renowned publisher, convince the editors, struggle with presentation and content requirements and finally receive a real book at about 99€.