Abstract
BACKGROUND: Cholangiocarcinoma (CCA) is a critical public health problem in Thailand. The prevalence is much higher than other areas in the world. Data about CCA are stored in different data sources and standards in both research data sets and electronic health records (EHR). OBJECTIVE: This study aims to integrate and analyze CCA data from various sources to investigate risk factors and develop prediction models using the Cholangiocarcinoma Ontology (CCAO). METHODS: Datasets from Thailand were annotated with CCAO and analyzed using ontology-based term enrichment methods. We applied ontology term enrichment analysis, similar to that used with the Gene Ontology, for identifying significant risk factors for suspected CCA and patients with CCA. Our program provided a list of significant terms associated with CCA and a visualization of the ontology hierarchy with significant terms highlighted. The outputs of the term enrichment analyses have been used as the inputs to machine learning classification tasks. RESULTS: The results confirmed that indicators for CCA include dilated bile ducts, periductal fibrosis, and hepatic mass, based on ultrasound findings from several years prior. Our analysis also revealed demographic and lifestyle risk factors such as male gender, having no education, alcohol consumption, smoking, being a farmer, and having diabetes. We seeded a random forest classifier with the term enrichment results and predicted CCA patients with average 0.92 precision-recall curve score (0.023 standard deviation) with age, dilated bile ducts, periductal fibrosis, suspected CCA, and hepatic mass as the top five important features. CONCLUSIONS: These findings can be used to focus and monitor populations at risk for CCA. Expanding CCAO with molecular data related to CCA using ontology-driven term enrichment analysis and machine learning will help us to discover new hypotheses to decrease the morbidity and mortality of CCA in Thailand.