Abstract
BACKGROUND: Extracting temporally sensitive outcomes such as tumor progression from unstructured electronic medical records (EMRs) remains a major challenge in oncology. This study evaluates a solution with a domain-adapted natural language processing (NLP) pipeline designed to extract structured, temporally anchored clinical outcomes from narrative EMR data. PATIENTS AND METHODS: Patients with oncogene-addicted advanced or metastatic non-small-cell lung cancer (NSCLC) treated with oral targeted therapies between January 2020 and June 2023 at a French academic hospital were included. Extracted Facts were benchmarked against expert annotations. All outputs were mapped to Observational Medical Outcome Partnership vocabularies. F1-scores were calculated for the correct Concept detection without and with their Temporality. Real-world progression-free survival (rwPFS) was estimated based on retrieved clinical outcomes. RESULTS: Among 1030 NSCLC patients treated between 2020 and 2023, 112 were confirmed to have advanced or metastatic disease with an oncogenic driver mutation, primarily EGFR (n = 66), ALK (n = 23), and KRAS (n = 16). The NLP pipeline achieved high accuracy in extracting clinical concepts, with an F1-score of 79.7% for tumor evolution concepts and 62.0% when temporality was included. Overall performance across all domains reached F1-scores of 76.5% for concept extraction and 63.7% with temporality. Median rwPFS was 21.9 months for EGFR-mutated, 52.4 months for ALK-translocated, and 5.0 months for KRAS-mutant tumors, in line with published benchmarks. Reviewing automatically collected data was 5.8 times faster compared with manual collection. CONCLUSIONS: Our solution demonstrates robust performance for extracting temporally structured tumor outcomes from EMRs and supports the reconstruction of real-world endpoints in oncology.