String Mining
In many applications, e.g., in computational biology, the goal is to
find interesting string or sequence patterns in data. Application areas
are, among others, finding discriminative features for sequence
classification or segmentation, discovering new binding motifs of
transcription factors, or probe design. We propose a new algorithmic
framework that solves frequency-related data mining queries on databases
of strings in optimal time, i.e., in time linear in the input and the
output size. The additional space is linear in the input size. Our
framework can be used to mine frequent strings, emerging strings and
strings that pass other statistical tests, e.g., the chi-square test. In
contrast to the presented result for strings, no optimal algorithms are
known for other pattern domains such as itemsets.
The key to our
approach are several recent results on index structures for strings,
among them suffix- and lcp-arrays, and a new preprocessing scheme for
range minimum queries. The advantages of array-based data structures
(compared with dynamic data structures such as trees) are good locality
behavior and extensibility to secondary memory. We tested our algorithm
on real-world data from computational biology and demonstrate that the
approach also works well in practice.
Publications
Fischer, J, Heun, V, and Kramer, S
(2005).
Fast Frequent String Mining Using Suffix Arrays
In: ICDM '05: Proceedings of the Fifth IEEE International Conference on Data Mining, pp. 609-612, Washington, DC, USA, IEEE Computer Society Press.
Fischer, J, Heun, V, and Kramer, S
(2006).
Optimal String Mining Under Frequency Constraints
In: Proc. of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2006), pp. 139-150.
Software
See the C++ source package provided online by Johannes Fischer. Please cite [FHK06] if you are using the software in a publication.
