Abstract
Linking cellular states to clinical phenotypes is a major challenge in single-cell analysis. Here, we present single-cell multiple instance learning for sample classification and associated subpopulation discovery (scMILD), a weakly supervised multiple instance learning framework that robustly identifies condition-associated cells using only sample-level labels. After systematically validating scMILD's accuracy through controlled simulations, we applied it to diverse disease datasets, confirming its ability to retrieve known biological signatures. Building on this, our sample-informed analysis of scMILD-identified monocytes in COVID-19 revealed a temporal transition from an early antiviral to a late stress-response state. Furthermore, in a cross-disease application, a model trained on COVID-19 successfully stratified patients with Lupus and distinguished shared inflammatory states from disease-specific ones. scMILD thus provides a validated and versatile strategy to dissect cellular heterogeneity, bridging single-cell observations with high-level phenotypes.