- Generic clustering is NP-hard. Formally, clustering aims at partitioning data such that a predefined quality criterion is maximized. Unfortunately, there is no generic way for determining an optimal solution in this setting except for testing all possible combinations. All practical clustering methods, such as k-means and linkage clustering, limit the search space by discarding partitionings and thus provide only an approximation to the best clustering (a local optimum). If you cluster data, you will likely obtain an approximate solution.
- Hierarchy is no clustering. Linkage clustering is a form of hierarchical clustering, where the data is first mapped to a hierarchical representation and then partitioned into clusters. In practice, people often consider the hierarchical representation of data as the final clustering result. However, a hierarchy encodes an exponential number of possible partitionings and the more involved step is to flatten the hierarchy into a clustering. As a negative example consider the Wikipedia entry on hierarchical clustering which almost solely focuses on constructing hierarchies—omitting details on the subsequent clustering procedure.
- Statistical consistency unclear. An important term from statistical learning theory is consistency. Shortly put a learning algorithm is consistent if it converges to the true solution the more training data is provided. Surprisingly, little is known about the consistency of clustering. For instance, most clustering methods aim at partitioning provided data but not the underlying data distributions. That is, if you provide more data to a clustering, there is no guarantee that the solution improves (see the work of Luxburg [1, 2]).
- Model selection vs. Clustering. Clustering is often controlled using a model parameter that determines the granularity of the partitioning. This parameter can be the number of clusters as for k-means but may also take the form of a numeric quantity as in linkage clustering. Clearly, this parameter needs to be adapted prior to application. However, model selection contradicts with the purpose of clustering. On the one hand, to validate the parameter on given data a reference partitioning needs to be available, thus no clustering is required in this case. On the other hand on unlabeled data the parameter can never be validated directly. In practice, one resorts to model selection on reference data and application of the best model to unknown data. Consequently, this setting requires a representative reference partitioning obtained without clustering.
Data clustering is a popular technique of data mining and machine learning. The objective is to automatically partition a set of given objects into "meaningful" groups. Clustering is intuitive and fits a variety of problem settings, for example intrusion detection and malware categorization in computer security. However, application of clustering methods is laborious and error-prone in practice. Here are my unpleasant facts of clustering:
The majority of current intrusion detection, anti-virus and anti-malware products builds on the concept of misuse detection, that is malicious activity is identified using rules and signatures of known misuse patterns. Consequently, much effort has been devoted to defining expressive and discriminative features for construction of misuse rules. In the domain of network security such features are often derived from the syntax of application layer protocols, for example to answer the questions "who did what to which data when?"
In contrast to misuse detection, a second strain of intrusion detection research investigates methods for anomaly detection, that is malicious activity is identified in terms of unusual and abnormal events. Surprisingly, the use of features from protocol syntax in this setting has been considered in only few research, for example Kruegel et al. and Ingham et al. Access to syntactical features is clearly beneficial for anomaly detection, and hence some of my colleagues (Patrick and Christian) have been working on extending previous approaches to incorporate protocol syntax into a generic anomaly detection framework.
Recent results of this work are presented at the International Conference on Information Systems Security. In particular, we introduce new features for network intrusion detection, which combine sequential features (such as byte frequencies and n-grams) with syntactical tokens. Here is the abstract of our contribution:
Incorporation of Application Layer Protocol Syntax into Anomaly Detection. Patrick Düssel, Christian Gehl, Pavel Laskov and Konrad Rieck. Proc. of International Conference on Information Systems Security (ICISS), December 2008.
In contrast to misuse detection, a second strain of intrusion detection research investigates methods for anomaly detection, that is malicious activity is identified in terms of unusual and abnormal events. Surprisingly, the use of features from protocol syntax in this setting has been considered in only few research, for example Kruegel et al. and Ingham et al. Access to syntactical features is clearly beneficial for anomaly detection, and hence some of my colleagues (Patrick and Christian) have been working on extending previous approaches to incorporate protocol syntax into a generic anomaly detection framework.
Recent results of this work are presented at the International Conference on Information Systems Security. In particular, we introduce new features for network intrusion detection, which combine sequential features (such as byte frequencies and n-grams) with syntactical tokens. Here is the abstract of our contribution:
The syntax of application layer protocols carries valuable information for network intrusion detection. Hence, the majority of modern IDS perform some form of protocol analysis to refine their signatures with application layer context. Protocol analysis, however, has been mainly used for misuse detection, which limits its application for the detection of unknown and novel attacks. In this contribution we address the issue of incorporating application layer context into anomaly-based intrusion detection. We extend a payload-based anomaly detection method by incorporating structural information obtained from a protocol analyzer. The basis for our extension is computation of similarity between attributed tokens derived from a protocol grammar. The enhanced anomaly detection method is evaluated in experiments on detection of web attacks, yielding an improvement of detection accuracy of 49%. While byte-level anomaly detection is sufficient for detection of buffer overflow attacks, identification of recent attacks such as SQL and PHP code injection strongly depends on the availability of application layer context.
Incorporation of Application Layer Protocol Syntax into Anomaly Detection. Patrick Düssel, Christian Gehl, Pavel Laskov and Konrad Rieck. Proc. of International Conference on Information Systems Security (ICISS), December 2008.
I am currently finishing my doctoral thesis, thus there is almost no time for interesting activities and fun. Fortunately, I am not the only one, see for instance here.
Besides all the work, the good news is that we will present an interesting paper on combining anomaly detection and intrusion prevention at the European Conference on Computer Network Defense (EC2ND). Here is the abstract from our contribution:
An Architecture for Inline Anomaly Detection. Tammo Krueger, Christian Gehl, Konrad Rieck and Pavel Laskov. Proc. of European Conference on Computer Network Defense (EC2ND), December 2008.
Besides all the work, the good news is that we will present an interesting paper on combining anomaly detection and intrusion prevention at the European Conference on Computer Network Defense (EC2ND). Here is the abstract from our contribution:
In this paper we propose an intrusion prevention system (IPS) which operates inline and is capable to detect unknown attacks using anomaly detection methods. Incorporated in the framework of a packet filter each incoming packet is analyzed and—according to an internal connection state and a computed anomaly score—either delivered to the production system, redirected to a special hardened system or logged to a network sink for later analysis. Run-time measurements of an actual implementation prove that the performance overhead of the system is sufficient for inline processing. Accuracy measurements on real network data yield improvements especially in the number of false positives, which are reduced by a factor of five compared to a plain anomaly detector.
An Architecture for Inline Anomaly Detection. Tammo Krueger, Christian Gehl, Konrad Rieck and Pavel Laskov. Proc. of European Conference on Computer Network Defense (EC2ND), December 2008.
Subscribe to:
Posts (Atom)

