Machine Learning for Computer Security

Good and bad times with machine learning and security research

Generalized Suffix Trees and Distinct Substrings.

This is a technical post on strings, substrings and suffix trees. This could get boring. You have been warned.

Everybody knows the concept of longest common substring, where a set of strings is given and the task is to determine the longest substring shared by the set. While intuitive and straightforward to implement using a generalized suffix tree, this setting is not robust against "noise", such as corrupted or truncated strings. Moreover, finding the longest common substring is useless, if the considered strings derive from different data distributions, thus not necessary sharing any substring.

These shortcomings can be addressed by determining a set of shared substrings, which are shared by at least m strings but not necessary all strings in the set. To restrict the amount of small shared substrings, one requires the substrings to have a minimum length. This setting has been applied in the PolyGraph and Hamsa paper for automatic signature generation from network data. Again, an implementation is easily realized using a depth-first traversal of a generalized suffix tree. Unfortunately, many of the shared substrings will be prefixes and suffixes of each other.

This issue is resolved by the set of distinct shared substrings, where a distinct substring x is not contained in any other distinct substring y, unless x appears in at least m strings independently of y. Implementing this setting using generalized suffix trees is a little bit more involved. Together with a colleague, I came up with this approach:
We start to determine interesting substrings by a depth-first traversal of a generalized suffix tree.

This time, however, we are looking for distinct substrings and hence can not afford to output a substring that later turns out to be non-distinct. We thus first visit nodes which give rise to longer substrings, that is for each node we order the child nodes according to the length of possible substrings. Consequently, we find longer substrings first.

If we have determined a first substring x shared by m strings, we can be sure that it is distinct as no longer substrings exist. Next, we need to mark all non-distinct substrings of x. We do this by traversing the suffix links and parent nodes of x, thereby processing all suffixes and prefixes of x. For each node we maintain a counter indicating the number of strings the corresponding substring is shared by. When processing the prefixes and suffixes of x, we decrement this counter by the number of occurences of x.

Finally, we continue the depth-first traversal. We do not descend to nodes with a counter smaller than m, as no distinct shared substring can be determined on the corresponding path.

To be honest, I have not thoroughly thought about the complexity of this algorithm. Yet, I expect it to be "somewhat linear" in the length of the considered strings. The output of the algorithm resembles the concept of distinct substrings proposed in PolyGraph, where the authors apply a costly post-processing of shared substrings. Note that this approach can also be implemented using suffix and LCP arrays with the traversal techniques studied by Kasai et al.

Did I mention that I like string algorithms?