Abstract
BACKGROUND: The aim of this study was to thoroughly analyze the reproducibility of radiomics feature extraction across three Image Biomarker Standardization Initiative (IBSI)-compliant platforms using a digital phantom for benchmarking. It uncovers high consistency among common features while also pointing out the necessity for standardization in computational algorithms and mathematical definitions due to unique platform-specific features. METHODS: We selected three widely used radiomics platforms: LIFEx, Computational Environment for Radiological Research (CERR), and PyRadiomics. Using the IBSI digital phantom, we performed a comparative analysis to extract and benchmark radiomics features. The study design included testing each platform's ability to consistently reproduce radiomics features, with statistical analyses to assess the variability and agreement among the platforms. RESULTS: The results indicated varying levels of feature reproducibility across the platforms. Although some features showed high consistency, others varied significantly, highlighting the need for standardized computational algorithms. Specifically, LIFEx and PyRadiomics performed consistently well across many features, whereas CERR showed greater variability in certain feature categories than LIFEx and PyRadiomics. CONCLUSION: The study findings highlight the need for harmonized feature calculation methods to enhance the reliability and clinical usefulness of radiomics. Additionally, this study recommends incorporating clinical data and establishing benchmarking procedures in future studies to enhance the role of radiomics in personalized medicine.