Abstract
Machine learning and artificial intelligence are increasingly applied to medical diagnostics and clinical decision-making. To evaluate model performance, the F1 score and its generalized form, the Fβ score, are widely used as they balance precision and sensitivity. However, rigorous statistical inference and power analysis for the F1 and Fβ scores remain limited. In this study, we propose psF1, a unified and comprehensive framework for interval estimation, hypothesis testing, and power and sample size calculation for both single and comparative F1 and Fβ scores. psF1 leverages exact probability distributions as well as approximations for large sample sizes to provide valid statistical inference and power analyses. Extensive simulations demonstrate the accuracy and robustness of psF1 across a range of sensitivity, precision, and sample size scenarios. We further showcase its practical utility through real-world biomedical classification tasks. This framework enables principled evaluation and comparison of classifiers using F1 and Fβ scores with reliable uncertainty quantification and informed sample size planning. psF1 is freely available at http://github.com/cyhsuTN/psF1.