Abstract
We developed a computational model, called "Unbiased microRNA-disease association predictor (UBMDA)," to predict microRNA-disease associations. UBMDA has two major differences from those reported previously. First, we did not apply a similarity-based feature extraction method, which is the main basis of previous studies. Instead, we used International Classification of Diseases 11th Revision disease codes and microRNA nucleotide sequences as input features. Thus, UBMDA can be applied to newly discovered or poorly studied microRNAs and diseases. Second, we constructed an appropriate negative sample dataset. A positive sample dataset consisting of microRNAs and diseases pairs with proven associations between microRNAs and diseases is publicly available. However, datasets reporting no associations between microRNAs and diseases are rare. Therefore, a negative sample dataset was created by combining microRNAs and diseases. Because more commonly studied microRNAs and diseases are more likely to be included in the positive sample dataset, creating a negative sample dataset without taking this bias into consideration could cause an imbalance in disease and microRNA frequencies between positive and negative sample datasets, leading to biased prediction. To prevent such an imbalance, we created a negative sample dataset considering the frequency of each microRNA and disease in the positive sample dataset, such that these frequencies were similar between the negative and positive sample datasets. We successfully developed a computational model with a simple and intuitive structure. UBMDA will contribute to accelerating the development of microRNA-related biomarkers and therapeutics.