Abstract
Background: Early detection of Alzheimer's disease (AD) is critical for timely intervention, but clinical assessments and neuroimaging are often costly and resource intensive. Natural language processing (NLP) of clinical records offers a scalable alternative, and integrating geolocation may capture complementary environmental risk signals. Methods: We propose the DMV (Data processing, Model training, Validation) framework that frames early AD detection as a regression task predicting a continuous risk score ("data_value") from clinical text and structured features. We evaluated embeddings from Llama3-70B, GPT-4o (via text-embedding-ada-002), and GPT-5 (text-embedding-3-large) combined with a Random Forest regressor on a CDC-derived dataset (≈284 k records). Models were trained and assessed using 10-fold cross-validation. Performance metrics included Mean Squared Error (MSE), Mean Absolute Error (MAE), and R(2); paired t-tests and Wilcoxon signed-rank tests assessed statistical significance. Results: Including geolocation (latitude and longitude) consistently improved performance across models. For the Random Forest baseline, MSE decreased by 48.6% when geolocation was added. Embedding-based models showed larger gains; GPT-5 with geolocation achieved the best results (MSE = 14.0339, MAE = 2.3715, R(2) = 0.9783), and the reduction in error from adding geolocation was statistically significant (p < 0.001, paired tests). Conclusions: Combining high-quality text embeddings with patient geolocation yields substantial and statistically significant improvements in AD risk estimation. Incorporating spatial context alongside clinical text may help clinicians account for environmental and regional risk factors and improve early detection in scalable, data-driven workflows.