Abstract
BACKGROUND: Artificial intelligence (AI)-based clinical decision support tools are increasingly developed for diagnostic stewardship, yet their clinical adoption remains limited. A barrier to implementation is concern about potential patient-level harm, particularly when algorithms recommend withholding diagnostic tests that would otherwise be standard practice. Existing evaluation frameworks emphasize aggregate performance metrics but do not provide structured methods to assess clinical consequences of algorithmic errors prior to implementation. OBJECTIVE: Developing a protocol for retrospective, patient-level evaluation of potential harms associated with false-negative predictions of an AI-tool for blood culture stewardship in the emergency department. METHODS: We developed a protocol to identify and evaluate cases at potential risk of harm following retrospective application of an AI-model predicting blood culture outcomes. False-negative cases, defined as positive blood cultures with a model-predicted probability below 5%, are selected after exclusion of contaminants and clinical scenarios in which the algorithm would not be applied. Multidisciplinary experts perform case-by-case evaluation using a questionnaire covering three domains: antibiotic management, diagnostic procedures, and patient outcomes, supplemented by overall harm and cost assessments. Inter-observer agreement is quantified, and discrepancies are resolved through expert adjudication. EXPECTED OUTCOMES AND SIGNIFICANCE: This protocol is designed as a preimplementation safety assessment to support go/no-go decisions for advancing AI tools into clinical research or practice. By operationalizing patient-level harm assessment using routinely collected data, the framework complements existing AI evaluation standards and addresses a critical gap in diagnostic stewardship. Although developed for blood culture stewardship, the protocol may be adaptable to other AI-based decision support tools in infectious diseases and beyond.