Abstract
Genome assembly aims to construct chromosome-level genome sequences, with scaffolding serving as a critical step, the accuracy of which highly depends on the quality of the input data. Although both Hi-C and Pore-C technologies are used to study genomic 3D structures, Pore-C demonstrates irreplaceable advantages in high-precision assembly due to its ability to capture long-range information and provide multi-fragment interaction information. However, most current scaffolding methods primarily rely on Hi-C data, which is limited by the inherent constraints of the technology, resulting in deficiencies in assembly continuity and accuracy. We propose a scaffolding method based on Pore-C data, named PHScaffolding. This method constructs a hypergraph by leveraging alignment information from Pore-C reads to capture multi-way interactions among contigs. A dedicated weighting scheme for hyperedges is also introduced. Subsequently, PHScaffolding applies the Louvain algorithm to cluster the hypergraph, aiming to group contigs originating from the same chromosome. Finally, for contigs within each cluster, the method employs a novel strategy to orient and order them based on Pore-C read alignments, thereby generating chromosome-level scaffolds. Evaluations on HG002, GM12878, and Arabidopsis thaliana contig datasets demonstrate that PHScaffolding achieves strong performance and robustness in terms of NA50, NGA50, and misassembly rates. Comparative experiments show that it outperforms traditional Hi-C-based scaffolding methods. The source code of PHScaffolding is available at https://github.com/Suquana/PHScaffolding.