You are here: Home Research String Mining

String Mining

In many applications, e.g., in computational biology, the goal is to find interesting string or sequence patterns in data. Application areas are, among others, finding discriminative features for sequence classification or segmentation, discovering new binding motifs of transcription factors, or probe design. We propose a new algorithmic framework that solves frequency-related data mining queries on databases of strings in optimal time, i.e., in time linear in the input and the output size. The additional space is linear in the input size. Our framework can be used to mine frequent strings, emerging strings and strings that pass other statistical tests, e.g., the chi-square test. In contrast to the presented result for strings, no optimal algorithms are known for other pattern domains such as itemsets.
The key to our approach are several recent results on index structures for strings, among them suffix- and lcp-arrays, and a new preprocessing scheme for range minimum queries. The advantages of array-based data structures (compared with dynamic data structures such as trees) are good locality behavior and extensibility to secondary memory. We tested our algorithm on real-world data from computational biology and demonstrate that the approach also works well in practice.

Publications

Fischer, J, Heun, V, and Kramer, S (2005).
Fast Frequent String Mining Using Suffix Arrays
In: ICDM '05: Proceedings of the Fifth IEEE International Conference on Data Mining, pp. 609-612, Washington, DC, USA, IEEE Computer Society Press.

Fischer, J, Heun, V, and Kramer, S (2006).
Optimal String Mining Under Frequency Constraints
In: Proc. of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2006), pp. 139-150.

Software

See the C++ source package provided online by Johannes Fischer. Please cite [FHK06] if you are using the software in a publication.