xkcd-repeats: A new taxonomy of repeats defined by their context diversity

Matthias Gallé, Matias Tealdi
Journal of Discrete Algorithms, 48, pp. 1-16

The context in which a substring appears is an important notion to identify – for example – its semantic meaning. However, existing definitions from stringology fail to model the context explicitly. We introduce here xkcd-repeats, a new family of repeats characterized by the number of different symbols at the left and right of their occurrences. These repeats include as special extreme cases the well-known classes of maximal and super-maximal repeats. We give sufficient and necessary condition to bound their number linearly in the size of the original string, and show an optimal algorithm that computes them in linear time – given a suffix array –, independent of the size of the alphabet, as well as two other algorithms that are faster in practice. We extend this in two ways: first we show how to generalize the notion of context. Instead of reducing it to the single symbol before and after an occurrence, we propose the notion of unbounded context, while keeping linear representation and computing time. Secondly, we provide a general framework which allows to compute these (and other) repeats incrementally; opening the door to streaming application.