Machine Learning for Computer Security

Good and bad times with machine learning and security research

Sally: Embedding Strings in Vector Spaces

I am happy to announce the first mature version of Sally, a small tool for efficiently mapping a set of strings to a vector space. Sally comes handy for those who regularly need to analyse strings and sequential data, such as sets of text documents, DNA sequences or network payloads. While simple statistics can be easily computed from a bunch of strings, applying involved analysis techniques can be quite tricky. The vast majority of methods from data mining and machine learning is defined in terms of vectors and thus not directly applicable to string data.

Sally addresses this problem by mapping a set of strings to a set of high-dimensional vectors. This mapping is referred to as embedding and allows for applying common machine learning tools to string data, such as Shogun, LibSVM, Weka, Matlab and Octave. The embedding is carried out by characterizing the content of strings using an implicit set of features. Each feature is associated with one dimension of the vector space and its occurrences are reflected in this dimension. Sally supports different types of these features, which range from frequencies of plain bytes to occurrences of consecutive words sequences (n-grams). Sally makes heavy use of sparse data structures for efficiently computing the embedding and allows to construct vectors from strings in linear time—irrespective of the selected features.

There are many applications for Sally. For instance, I have been using some of the underlying algorithms for network intrusion detection, analysis of malicious software and prevention of drive-by downloads. However, I have blogged a lot about security recently, thus I have prepared three examples from different domains. All examples include instructions and data sets for playing with Sally and the respective analysis task:
  • Text categorization: the first example introduces the task of text categorization and shows how Sally can be used to map documents to a vector space. The categorization is then learned on the embedded documents using a Support Vector Machine.

  • Gene finding: this example presents an application of Sally for analysis of DNA sequences. Sally is used to map the DNA sequences to a vector space, where one can effectively discover the start of genes.

  • Language analysis: The third example deals with the analysis of natural languages. We are interested in comparing different languages and learn about their relations. Sally is used to map and compare text documents in a vector space.
Sally can be downloaded here http://www.mlsec.org/sally and all examples are available at this location http://www.mlsec.org/sally/examples.html.