Abstract
The suffix array, describing the lexicographical order of suffixes of a given text, and the suffix tree, a path-compressed trie of all suffixes, are the two most fundamental data structures for string processing, with plethora of applications in data compression, bioinformatics, and information retrieval. For a length-n text, however, they use Θ(n log n) bits of space, which is often too costly. To address this, Grossi and Vitter [STOC 2000] and, independently, Ferragina and Manzini [FOCS 2000] introduced space-efficient versions of the suffix array, known as the compressed suffix array (CSA) and the FM-index. Sadakane [SODA 2002] then showed how to augment them to obtain the compressed suffix tree (CST). For a length-n text over an alphabet of size σ, these structures use only 𝒪(n log σ) bits. Nowadays, these structures are part of the standard toolbox: modern textbooks spend dozens of pages describing their applications, and they almost completely replaced suffix arrays and suffix trees in space-critical applications. The biggest remaining open question is how efficiently they can be constructed. After two decades, the fastest algorithms still run in 𝒪(n) time [Hon et al., FOCS 2003], which is Θ(logσn) factor away from the lower bound of Ω(n/logσn) (following from the necessity to read the input). In this paper, we make the first in 20 years improvement in n for this problem by proposing a new compressed suffix array and a new compressed suffix tree which admit o(n)-time construction algorithms while matching the space bounds and the query times of the original CSA/CST and the FM-index. More precisely, our structures take 𝒪(n log σ) bits, support SA queries and full suffix tree functionality in 𝒪(logϵn) time per operation, and can be constructed in [Formula: see text] time using 𝒪(n log σ) bits of working space. (For example, if σ = 2, the construction time is [Formula: see text].) We derive this result as a corollary from a much more general reduction: We prove that all parameters of a compressed suffix array/tree (query time, space, construction time, and construction working space) can essentially be reduced to those of a data structure answering new query types that we call prefix rank and prefix selection. Using the novel techniques, we also develop a new index for pattern matching.