Machine Learning for Computer Security

Good and bad times with machine learning and security research

Call for Papers: SICHERHEIT 2010

I am happy to serve on the program committee of the conference "Sicherheit, Schutz und Zuverlässigkeit" (SICHERHEIT) which will take place in October 2010 in Berlin. SICHERHEIT is the German forum for experts from academia and industry to discuss aspects of safety (protection from catastrophic events of technical systems) and security (protection of confidentiality and integrity of information in technical systems).

The organizers seek submissions from the broad range of safety and security, for example on the following topics:
  • biometrics, privacy, data protection
  • e-commerce, e-government
  • certification of safe and secure systems
  • cryptography, digital signatures, steganography
  • development and maintenance of safe and secure systems
  • formal methods for safe and secure systems
  • management of safe and secure systems
  • management of information security
  • network security
  • reactive security
  • reliability and availability
The conference language is English and SICHERHEIT welcomes participants and submissions from non-German-speaking countries. More information can be found at www.sicherheit2010.de. Deadline for paper submissions is March, 29th. So hurry up!

Efficient Machine Learning with Trees.

Trees are a natural and intuitive representation of data in many domains of computer science. Although ubiquitous, analysis of tree data is far from trivial. Most techniques of machine learning are confined to operate in vector spaces and lack the ability to process trees. A solution to this problem is provided by tree kernels which allow for application of kernel-based learning methods to tree data. Unfortunately, the run-time complexity of tree kernels is inherently quadratic in the number of nodes. While small trees can be processed with minor overhead, learning with larger trees gets intractable. For example, the computation of a regular kernel for two parse trees of HTML documents comprising 10,000 nodes each, requires about 1 Gigabyte of memory and takes over 100 seconds on a recent computer system. Given that kernel computations are performed millions of times in large-scale learning, it is evident that regular tree kernels are an inappropriate choice in many learning tasks.

As a result, we have often abstained from using tree data in our research. However, we recently found the time to address this problem and devised approximate tree kernels. Instead of fiddling with algorithmic issues and implementations, we propose to approximate the computation of kernel functions for trees. To this end, we narrow the kernel computation to a sparse subset of subtrees rooted at relevant symbols and thereby avoid considering all possible subtrees as in regular tree kernels. This sparse subset is automatically selected with respect to a given learning task, such that the expressiveness of the approximate kernel is preserved while its run-time is reduced. Though simple in design, this approximation enables speed-ups up to three orders of magnitude and, for the first time, allows to compare HTML documents for detection of Web spam efficiently.

The concept of approximate tree kernels as well as several applications are described in a recent article published in the Journal of Machine Learning Research. The abstract for the article is given below:
Convolution kernels for trees provide simple means for learning with tree-structured data. The computation time of tree kernels is quadratic in the size of the trees, since all pairs of nodes need to be compared. Thus, large parse trees, obtained from HTML documents or structured network data, render convolution kernels inapplicable. In this article, we propose an effective approximation technique for parse tree kernels. The approximate tree kernels (ATKs) limit kernel computation to a sparse subset of relevant subtrees and discard redundant structures, such that training and testing of kernel-based learning methods are significantly accelerated. We devise linear programming approaches for identifying such subsets for supervised and unsupervised learning tasks, respectively. Empirically, the approximate tree kernels attain run-time improvements up to three orders of magnitude while preserving the predictive accuracy of regular tree kernels. For unsupervised tasks, the approximate tree kernels even lead to more accurate predictions by identifying relevant dimensions in feature space.
Approximate Tree Kernels. Konrad Rieck, Tammo Krueger, Ulf Brefeld and Klaus-Robert Müller. Journal of Machine Learning Research (JMLR), 11(Feb):555−580, Microtome, 2010.