Abstract
Here, we introduce the ID-GBA (Information Distance Guilt By Association) method to expand highly connected sets of nodes by deploying a novel algorithm for subgraph extension based on the guilt-by-association principle and information distance. In this study, ID-GBA was utilized to expand disease clusters, and identify novel disease genes. We first validate its ability to expand related disease sets from disease/disease graphs built using Open Targets' gene association scores. We then analyze disease/control gene expression networks and show that ID-GBA recaptures known disease genes in nine disease/control graphs. Compared to existing methods such as Random Walk with Restarts and Personalized PageRank, ID-GBA achieves significantly higher Normalized Discounted Cumulative Gain scores, which indicates superior predictive performance at capturing known disease genes. Additionally, unlike other approaches that require users to specify either a threshold parameter or a fixed number of nodes to include in the extended subgraph, ID-GBA includes a built-in, automated, and data-driven thresholding mechanism. These results establish ID-GBA as a novel open-source tool to uncover hidden relationships in gene/gene, disease/disease, and other complex networks.