Machine Learning for Computer Security

Good and bad times with machine learning and security research

Efficient Machine Learning with Trees.

Trees are a natural and intuitive representation of data in many domains of computer science. Although ubiquitous, analysis of tree data is far from trivial. Most techniques of machine learning are confined to operate in vector spaces and lack the ability to process trees. A solution to this problem is provided by tree kernels which allow for application of kernel-based learning methods to tree data. Unfortunately, the run-time complexity of tree kernels is inherently quadratic in the number of nodes. While small trees can be processed with minor overhead, learning with larger trees gets intractable. For example, the computation of a regular kernel for two parse trees of HTML documents comprising 10,000 nodes each, requires about 1 Gigabyte of memory and takes over 100 seconds on a recent computer system. Given that kernel computations are performed millions of times in large-scale learning, it is evident that regular tree kernels are an inappropriate choice in many learning tasks.

As a result, we have often abstained from using tree data in our research. However, we recently found the time to address this problem and devised approximate tree kernels. Instead of fiddling with algorithmic issues and implementations, we propose to approximate the computation of kernel functions for trees. To this end, we narrow the kernel computation to a sparse subset of subtrees rooted at relevant symbols and thereby avoid considering all possible subtrees as in regular tree kernels. This sparse subset is automatically selected with respect to a given learning task, such that the expressiveness of the approximate kernel is preserved while its run-time is reduced. Though simple in design, this approximation enables speed-ups up to three orders of magnitude and, for the first time, allows to compare HTML documents for detection of Web spam efficiently.

The concept of approximate tree kernels as well as several applications are described in a recent article published in the Journal of Machine Learning Research. The abstract for the article is given below:
Convolution kernels for trees provide simple means for learning with tree-structured data. The computation time of tree kernels is quadratic in the size of the trees, since all pairs of nodes need to be compared. Thus, large parse trees, obtained from HTML documents or structured network data, render convolution kernels inapplicable. In this article, we propose an effective approximation technique for parse tree kernels. The approximate tree kernels (ATKs) limit kernel computation to a sparse subset of relevant subtrees and discard redundant structures, such that training and testing of kernel-based learning methods are significantly accelerated. We devise linear programming approaches for identifying such subsets for supervised and unsupervised learning tasks, respectively. Empirically, the approximate tree kernels attain run-time improvements up to three orders of magnitude while preserving the predictive accuracy of regular tree kernels. For unsupervised tasks, the approximate tree kernels even lead to more accurate predictions by identifying relevant dimensions in feature space.
Approximate Tree Kernels. Konrad Rieck, Tammo Krueger, Ulf Brefeld and Klaus-Robert Müller. Journal of Machine Learning Research (JMLR), 11(Feb):555−580, Microtome, 2010.

0 comments: