Abstract
Cryogenic electron microscopy (cryoEM) has revolutionized structural biology by enabling atomic-resolution visualization of biomacromolecules. With artificial intelligence (AI) increasing role in newly developed cryoEM tools, task-specific datasets have become essential. Yet assembling such datasets often demands considerable effort and domain expertise, constraining AI-driven cryoEM tool development efforts. Here, we present CryoDataBot, an automated pipeline that addresses this gap. CryoDataBot streamlines data retrieval, preprocessing, and labeling, with fine-grained quality control and flexible customization, enabling efficient generation of robust datasets. CryoDataBot's effectiveness is demonstrated through improved training efficiency in U-Net models and rapid, effective retraining of CryoREAD, a widely used RNA modeling tool. By simplifying the workflow and offering customizable quality control, CryoDataBot enables researchers to easily tailor dataset construction to the specific objectives of their models, while ensuring high data quality and reducing manual workload. This flexibility supports tools development for a wide range of applications in AI-driven structural biology.