SeArcH schemes for Approximate stRing mAtching.

Finding approximate occurrences of a query in a text using a full-text index is a central problem in stringology with many applications, especially in bioinformatics. The recent work has shown significant speed-ups by combining bidirectional indices and employing variations of search schemes. Search schemes partition a query and describe how to search the resulting parts with a given error bound. The performance of search schemes can be approximated by the node count, which represents an upper bound of the number of search steps. Finding optimum search schemes is a difficult combinatorial optimization problem that becomes hard to solve for four and more errors. This paper improves on a few topics important to search scheme based searches: First, we show how search schemes can be used to model previously published approximate search strategies such as suffix filters, 01*0-seeds, or the pigeonhole principle. This unifies these strategies in the search scheme framework, makes them easily comparable and results in novel search schemes that allow for any number of errors. Second, we present a search scheme construction heuristic, which is on par with optimum search schemes and has a better node count than any known search scheme for equal or above four errors. Finally, using the different search schemes, we show that the node count measure is not an ideal performance metric and therefore propose an improved performance metric called the weighted node count, which approximates a search algorithm's run time much more accurately.

期刊：	NAR Genomics and Bioinformatics	影响因子：	2.800
时间：	2025	起止号：	2025 Mar 18; 7(1):lqaf025
doi：	10.1093/nargab/lqaf025

SeArcH schemes for Approximate stRing mAtching.

特别声明