Sally addresses this problem by mapping a set of strings to a set of high-dimensional vectors. This mapping is referred to as embedding and allows for applying common machine learning tools to string data, such as Shogun, LibSVM, Weka, Matlab and Octave. The embedding is carried out by characterizing the content of strings using an implicit set of features. Each feature is associated with one dimension of the vector space and its occurrences are reflected in this dimension. Sally supports different types of these features, which range from frequencies of plain bytes to occurrences of consecutive words sequences (n-grams). Sally makes heavy use of sparse data structures for efficiently computing the embedding and allows to construct vectors from strings in linear time—irrespective of the selected features.
There are many applications for Sally. For instance, I have been using some of the underlying algorithms for network intrusion detection, analysis of malicious software and prevention of drive-by downloads. However, I have blogged a lot about security recently, thus I have prepared three examples from different domains. All examples include instructions and data sets for playing with Sally and the respective analysis task:
- Text categorization: the first example introduces the task of text categorization and shows how Sally can be used to map documents to a vector space. The categorization is then learned on the embedded documents using a Support Vector Machine.
- Gene finding: this example presents an application of Sally for analysis of DNA sequences. Sally is used to map the DNA sequences to a vector space, where one can effectively discover the start of genes.
- Language analysis: The third example deals with the analysis of natural languages. We are interested in comparing different languages and learn about their relations. Sally is used to map and compare text documents in a vector space.