Abstract
This study aimed to compare the predictive performance of traditional stone scoring systems with a large language model based on ChatGPT in estimating stone-free rates following percutaneous nephrolithotomy. A total of 340 patients who underwent the procedure between 2019 and 2025 were retrospectively analyzed. Preoperative stone complexity was evaluated using four established scoring systems-Guy's Stone Score, the CROES nomogram, the S.T.O.N.E. nephrolithometry score, and the Seoul National University Renal Stone Complexity score-and each case was additionally processed through a ChatGPT-based prediction model. The predicted outcomes of each method were compared with actual postoperative results using correlation analysis and multivariate regression. The overall stone-free rate was 60.9%. Patients who achieved stone-free status had significantly lower Guy's Stone Score, S.T.O.N.E., and S-ReSC values than those with residual stones (all p < 0.001). In contrast, neither the CROES nomogram (p = 0.19) nor the ChatGPT-based predicted stone-free probability (p = 0.549) differed significantly between the two groups. Univariate analysis revealed that higher values in Guy's Stone Score, S.T.O.N.E., and S-ReSC scores were associated with stone-free failure. Multivariate analysis identified Guy's Stone Score and S.T.O.N.E. score as independent predictors of surgical success. In contrast, the ChatGPT-based model showed limited predictive performance and failed to provide reliable estimates for stone-free rates in our study. These findings support the continued clinical utility of conventional scoring systems while emphasizing the need for further development and validation of artificial intelligence models. Large language models must be trained on structured clinical datasets and externally validated before their integration into surgical decision-making processes in endourology.