We consider a component of the word statistics known as clump; starting from a finite set of words, clumps are maximal overlapping sets of these occurrences. This object has first been studied by Schbath with the aim of counting the number of occurrences of words in random texts. Later work with similar probabilistic approach used the Chen-Stein approximation for a compound Poisson distribution, where the number of clumps follows a law close to Poisson. Presently there is no combinatorial counterpart to this approach, and we fill the gap here. We also provide a construction for the yet unsolved problem of clumps of an arbitrary finite set of words. In contrast with the probabilistic approach which only provides asymptotic results, the combinatorial method provides exact results that are useful when considering short sequences.
Jongsuk Kongsen;Supaporn Chairungsee, Proceedings of the 2017 International Conference on Information Technology, Using Suffix Tray and Longest Previous Factor for Pattern Searching, 2017, Singapore Singapore, 10.1145/3176653.3176662.
Donald E. K. Martin;Laurent Noé, 2015, Faster exact distributions of pattern statistics through sequential elimination of states, Annals of the Institute of Statistical Mathematics, 69, 1, pp. 231-248, 10.1007/s10463-015-0540-y.
Laurent Noé;Donald E.K. Martin, 2014, A Coverage Criterion for Spaced Seeds and Its Applications to Support Vector Machine String Kernels andk-Mer Distances, 21, 12, pp. 947-963, 10.1089/cmb.2014.0173, https://hal.inria.fr/hal-01083204.
Tobias Marschall;Sven Rahmann, Lecture notes in computer science, Speeding Up Exact Motif Discovery by Bounding the Expected Clump Size, pp. 337-349, 2010, 10.1007/978-3-642-15294-8_28.