Abstract
Background: Digital pathology (DP) combined with fluorescence confocal microscopy (FCM) allows rapid tissue assessment while preserving specimens. Artificial intelligence (AI) and large language models (LLMs) may enhance diagnostic workflows, but their role in pediatric surgical pathology is largely unexplored. Methods: We conducted a prospective, single-center study including 20 pediatric surgical cases with ex vivo FCM images acquired intraoperatively. Two commercially available LLMs, GPT-4V (AnPathology-Gpt) and Claude 3.7 Sonnet (AnPathology Project), were tested using structured prompts to generate diagnostic reports with and without immunohistochemistry (IHC) data, when available. Outputs were compared against the gold standard diagnosis by an experienced pediatric pathologist. Diagnostic performance was evaluated through accuracy, sensitivity, specificity, and Cohen's kappa. A paired sub-analysis was performed for cases with IHC (n = 5), and a sensitivity analysis excluding IHC cases (n = 15) was conducted. Results: Across all 20 cases, AnPathology-Gpt achieved 85% accuracy, 100% sensitivity, 86% specificity, and κ = 0.78, while AnPathology Project reached 80% accuracy, 100% sensitivity, 57% specificity, and κ = 0.63. Both models correctly identified all 13 neoplastic cases, with errors limited to non-neoplastic lesions mimicking tumors. In the IHC sub-analysis, accuracy improved from 40% to 80% and sensitivity from 50% to 100% for both models, resolving two false negatives observed in the FCM-only evaluation. Sensitivity analysis excluding IHC confirmed consistency of the results. Conclusions: This pilot study demonstrates that multimodal LLMs can support accurate and rapid diagnosis in pediatric digital pathology. The addition of IHC improves performance in diagnostically complex cases. Larger multicenter studies are needed to validate these findings and to define the role of AI-assisted workflows in pediatric surgical pathology.