Abstract
BACKGROUND: The landscape of cancer information has expanded across diverse online platforms. However, traditional methods such as manual coding are limited in their ability to efficiently identify information sources in large-scale datasets. This study introduces a novel approach that employs prompt engineering to automatically and thematically classify sources of online information on cervical cancer. METHODS: We identified 1,877 Korean online communities-referred to as “cafés"-that provide cervical cancer information. An initial codebook was developed using a zero-shot approach with GPT-4o. Two human coders then reviewed a sample of 500 cafés and iteratively added categories until reaching theoretical saturation, thereby refining the initial codebook. To validate the finalized version, which consisted of twelve categories, a separate sample of 200 cafés was independently coded by two coders (Cohen's kappa = 0.82; 95% CI [0.76-0.88]). We then structured a prompt for the automated classification of the full dataset. RESULTS: The prompts followed a step-by-step structure consisting of (1) main keywords for classification and (2) specific instructions. An initial prompt was applied to GPT-4o and demonstrated acceptable agreement with human coders (Krippendorff's α = 0.84, 95% CI [0.82-0.86] for the full dataset). The finalized prompt that contained additional detailed instructions was applied to GPT-4o and Gemini 1.5 Pro. The results demonstrated a substantial agreement among human coders, GPT-4o, and Gemini 1.5 Pro (Krippendorff's α = 0.81; 95% CI [0.80-0.83]). CONCLUSIONS: This study highlights the potential of human-AI collaboration in large-scale thematic classification. By integrating the efficiency of AI with human oversight, the proposed approach enhances both methodological validity and interpretive reliability. It offers a scalable pathway for future research in public health, infodemiology, and health communication. KEY MESSAGES: • Generative AI models (GPT-4o and Gemini 1.5 Pro) can reliably replicate human coding judgments in multi-category classification tasks when guided by a structured prompt. • Human–AI collaboration effectively supports the identification of key information sources in cancer infodemiology by combining AI efficiency with human oversight.