Abstract
BACKGROUND: Electronic patient records are a valuable yet underused data source; they have been explored in research using natural language processing, but not yet within a third-sector organization. OBJECTIVE: This study aimed to apply natural language processing to develop a risk identification tool capable of discerning high and low suicide risk among veterans, using electronic patient records from a United Kingdom-based veteran mental health charity. METHODS: A total of 20,342 notes were extracted for this purpose. To develop the risk tool, 70% of the records formed the training dataset, while the remaining 30% were allocated for testing and evaluation. The classification framework was devised and trained to categorize risk as a binary outcome: 1 indicating high risk and 0 indicating low risk. RESULTS: The efficacy of each classifier model was assessed by comparing its results with those from clinical risk assessments. A logistic regression classifier was found to perform best and was used to develop the final model. This comparison allowed for the calculation of the positive predictive value (mean 0.74, SD 0.059; 95% CI 0.70-0.77), negative predictive value (mean 0.73, SD 0.024; 95% CI 0.72-0.75), sensitivity (mean 0.75, SD 0.017; 95% CI 0.74-0.76), F(1)-score (mean 0.74, SD 0.033; 95% CI 0.72-0.76), and accuracy, which was measured using the Youden index (mean 0.73, SD 0.035; 95% CI 0.71-0.76). CONCLUSIONS: The risk identification tool successfully determined the correct risk category of veterans from a large sample of clinical notes. Future studies should investigate whether this tool can detect more nuanced differences in risk and be generalizable across data sources.