Abstract
BACKGROUND: Domains can be viewed as portable units of protein structure, folding, function, evolution, and design. Small proteins are often found to be composed of only a single domain, while most large proteins consist of multiple domains for achieving various composite cellular functions. A dysfunction in domains may affect the function of proteins in some disease. Inferring the disease-related domains will help our understanding of the mechanism of human complex diseases. RESULTS: In this study, we firstly build a global heterogeneous information network based on structural-based domains, proteins, and diseases. Then the topological features of the network are extracted according to the meta-paths between domain and disease nodes. Finally, we train a binary classifier based on the XGBOOST (eXtreme Gradient Boosting) algorithm to predict the potential associations between domains and diseases. The results show that the binary classification model using the XGBOOST algorithm performs significantly better than models using other machine learning algorithms, achieving an AUC (Area Under Curve) score of 0.94 in the leave-one-out cross-validation experiment. CONCLUSIONS: We develop a method to build a binary classifier using the topological features based on meta-paths and predict the potential associations between domains and diseases. Based on its predictive performance in independent test sets, the method is proved to be powerful. Moreover, representing domains and diseases through integrating more multi-omic data will further optimize predictive performance.