Abstract
The integration of sequenced samples and clinical data from independent yet related studies from public domain databases, such as The Sequence Read Archive (SRA), has the potential to increase sample sizes and enhance the statistical power needed for more precise bioinformatic analysis. Data mining and sample grouping are the starting points in this process and still present several challenges, including the presence of structured and unstructured data, missing deposited data, and varying experimental conditions and techniques applied across the studies. Designed to address the main challenges of data mining and sample grouping for biomarkers research, the proposed methodology employs a computational approach integrating relational database construction, text and data mining, natural language processing, network analysis, search by Pubmed publications, and combining MeSH, TTD and WordNet database to identify groups of samples with the same characteristics. As a result, it identifies and illustrates relationships among sample collections, aiming to discover potential cancer biomarkers. In colorectal cancer (CRC) and acute lymphoblastic leukemia (ALL) case studies, this methodology effectively navigates SRA metadata, retrieving, extracting, and integrating data. It highlights significant connections between samples and patient clinical data, revealing important biological insights. The study grouped 2,737 (CRC) and 3,655 (ALL) samples into potential comparison groups, demonstrating the method's power in identifying relationships and aiding biomarker discovery.