In this post, I present two hacks extending the concept of n-grams that I recently added to my tool Sally. The tool maps a set of strings to a vector space, such that techniques from machine learning can be directly applied to string data. To this end, the strings are characterized by n-grams (substrings of either n bytes or words), where each n-gram is associated with one dimension of the vector space (see the vector space model or bag of words model). If an n-gram occurs in a string, the respective dimension in the vector corresponding to the string is set to 1; otherwise it is 0.
The first hack deals with the position of n-grams in the strings:
Positional n-grams: In some applications the considered strings are aligned and the positions of patterns in the strings play a central role. For instance, in transcription start site detection one tries to identify the start of genes using substrings and their position in the DNA. Positional n-grams address this setting. The n-grams are extracted along with their position in the strings, such that an n-gram matches between two strings only, if occurs at the same position. In particular, Sally maps the strings to a vector space, whose dimensions are associated with both, the n-grams and their positions. For example, the string "abab" contains the following positional 2-grams: "ab@1", "ba@2" and "ab@3".The second hack can add a little robustness to the analysis:
Sorted n-grams: As any data, strings can be noisy. Words are sometimes swapped in text and DNA bases may flip positions due to mutations. Such noise can be slightly compensated by sorted n-grams. After the extraction of an n-gram, the symbols of the n-grams are sorted, thereby removing the local ordering of the data. For example, a 3-gram of the words "john was here" becomes "here john was". This sorting obviously destroys relevant information of the n-grams, yet if the local ordering is perturbed by noise anyway, sorted n-grams are an alternative for analyzing string data. One rather obscure application of this technique is the analysis of Yoda quotes, which are permuted by design.You can find out more about both features in the manual of Sally.
Last but not least, if you are using Sally, I would be interested to learn about where and how Sally is applied to your data. As I have designed the tool mainly for my own needs, I am really looking forward to learn about different applications.