The majority of machine learning and artificial intelligence methods are designed to work on vectorial data. In practice, however, data does not always come in the form of simple vectors. For instance, analysis of HTML documents requires processing and matching tree structures. The state-of-the-art techniques for learning with trees are convolutional
kernel functions. These functions are effective but painfully slow on large trees. To speed-up such kernels for security applications, we came up with
approximate tree kernels. Here is the abstract from the corresponding technical report:
Convolution kernels for trees provide effective means for learning with tree-structured data, such as parse trees of natural language sentences. Unfortunately, the computation time of tree kernels is quadratic in the size of the trees as all pairs of nodes need to be compared: large trees render convolution kernels inapplicable. In this paper, we propose a simple but efficient approximation technique for tree kernels. The approximate tree kernel (ATK) accelerates computation by selecting a sparse and discriminative subset of subtrees using a linear program. The kernel allows for incorporating domain knowledge and controlling the overall computation time through additional constraints. Experiments on applications of natural language processing and web spam detection demonstrate the efficiency of the approximate kernels. We observe run-time improvements of two orders of magnitude while preserving the discriminative expressiveness and classification rates of regular convolution kernels.
Approximate Kernels for Trees. Konrad Rieck, Ulf Brefeld and Tammo Krueger.
Technical Report FIRST 5/2008, Fraunhofer Institute FIRST, September 2008.
I have been recently running experiments with a "fast" learning method for anomaly detection in HTTP and FTP payloads. While the method performed well in terms of attack detection, the realized throughput was frustrating. Incoming traffic was processed at 15 mbit/s (HTTP) and 30 mbit/s (FTP) on a single CPU, including traffic normalization, TCP reassembly and anomaly detection.
How useful is such a low throughput?
Well, better than one might think. Depending on the considered application-layer protocol, there is often an imbalance of ingress and egress traffic. For example, the HTTP traffic at our institute has an ingress-egress ratio of 1/50, while the FTP traffic recorded at LBNL (port 21 only) reaches a ratio of 1/6. Thus, when looking at the total throughput the considered learning method yields around 770 mbit/s (HTTP) and 200 mbit/s (FTP) throughput. That's not so frustrating, given that the performance could be easily increased using multi-core systems.
For other protocols, however, the ingress-egress is not a small fraction and one can not indirectly benefit from the imbalance of traffic. In conclusion, when analyzing the run-time performance of an intrusion detection method it really matters which protocol and direction is monitored for evil things.