Abstract
Accurately predicting novel compound-protein interactions (CPIs) is essential for accelerating drug discovery. The generalizability of machine learning-based CPI prediction models relies significantly on the availability and diversity of CPI datasets. To maximize data utility, particularly for highly confidential datasets maintained by industries, federated learning (FL)-which integrates multi-site data while preserving privacy-has emerged as a promising approach. Nonetheless, its effectiveness when applied to heterogeneous data from diverse molecular domains, a common real-world scenario, remains unclear, thereby limiting its broader adoption. This study evaluates FL for CPI prediction using datasets spanning multiple chemical and protein domains, providing practical guidance for optimizing the FL approach. Results indicate that the FL model enhanced out-of-domain prediction performance but was surpassed by local models for in-domain data under data heterogeneity. Drawing on these findings, a new strategy was developed to achieve robust performance for in- and out-of-domain tasks: a similarity-guided ensemble (SGE) that combines the global FL model with fine-tuned models based on each client's local data. This method demonstrated effectiveness with real-world industry data, including samples from the public database and 13 pharmaceutical companies. Cumulatively, these findings offer practical guidance for implementing FL in contemporary drug discovery processes. SCIENTIFIC CONTRIBUTION: This study identifies the performance trade-offs caused by heterogeneous data distributions in FL for CPI prediction. To overcome these challenges, we developed a workflow integrating local fine-tuning and a SGE, ensuring robust accuracy for both in-domain and out-of-domain predictions. The effectiveness of this approach was validated using both public datasets and real-world in-house datasets from 13 pharmaceutical companies.