Abstract
Recent advances in artificial intelligence (AI) and machine learning (ML) algorithms have significantly broadened the scope in which these algorithms are deployed. The use of these algorithms in high-stakes medical contexts requires that the algorithms have appropriate guardrails to ensure their performance. One such guardrail involves using out-of-distribution (OOD) detection algorithms, which detect if observations are unlikely to be sampled from the model's training distribution. Since these observations are not observed/rare in the training data, the model is likely to be unreliable when performing on these observations. In the medical context, identifying which patients are OOD may improve the performance of the model by filtering out patients on which the model has not been properly trained or tested. Here, we assess the performance of state-of-the-art OOD detection algorithms on three medical datasets of image, transcriptomics, and time series modalities, respectively. Using a simulated training-deployment scenario, we find that several OOD detectors consistently identify patients on which the model performs worse. Further, several OOD detectors identified patient subsets that were underrepresented in the training data, prompting further investigation. We present that the use of OOD detection methods could help mitigate model risk when deploying medical AI in the real world.