Abstract
BACKGROUND: Super pangenomes, as complete genome sequencing at the genus level, have provided new insights into the speciation and evolution of functional genes. Genome size (GS) estimation is a critical first step. Although K-mer-based GS evaluators are applied extensively to guide genome assembly process and quality assessment, the results vary substantially with the tools and parameters used, presenting challenges for genus-level genome studies. RESULTS: Here, we investigated K-mer spectra from datasets of species with and without whole genome duplication, revealing that the trade-off in K-mer length amplified the signal of genomic characteristics related to repeat content or heterozygosity. Moreover, GS predictions were influenced by genomic heterozygosity and sequencing accuracy when different K-mer lengths were employed. In contrast, consistent GS predictions were obtained across all HiFi-based evaluations, demonstrating high accuracy of the derived limiting values from the regions of GS evaluation convergence during continuous variation of K. Unlike traditional methods that rely on single predictions, we introduced a closed-loop GS-estimating framework, that incorporates steady-value calculations, leveraging the continuity and accuracy of HiFi reads. Finally, we developed a high-performance pipeline, LVgs (https://github.com/xingjianfeng100/LVgs), by integrating FastK and GenomeScope 2.0. CONCLUSIONS: The robustness and applicability of LVgs for genus-level species was demonstrated through its application to various diploid and polyploidy species. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12864-025-12031-9.