Abstract
BACKGROUND: Prolonged hospital length of stay (LOS) drains both human and material hospital resources as well having a deleterious psychological and physiological effect on the patient. Some patients are at higher risk of a prolonged LOS, and it is important to identify them in the first days after admission so as to implement appropriate measures as soon as possible. This allows staff and bed occupancy needs to be programmed, as well as appropriate social services on discharge from hospital. Prolonged LOS depends on a wide range of clinical, demographic, and healthcare parameters, which are present in hospital information systems. We hypothesized that the variables recorded by physicians in the first 2 days of hospitalization reflect their precise knowledge of the patient's condition and could be used to predict prolonged LOS. We aimed to develop a prediction model based on a restricted number of the many parameters available in the first 2 days after admission. METHODS: Deidentified patient administrative and clinical data are stored in our French University Hospital's Clinical Data Warehouse. We present a two-stage predictive modelling experiment conducted on data from 134 840 adult patients with 273 693 hospitalizations between 2016 and 2018. Initially, we utilized conventional clinical variables and composite variables (counts of procedures/medications aggregated into categorical bins to form new variables) in several machine-learning algorithms to select the best-performing model. Next, we employed the SHAP (SHapley Additive exPlanation) method to identify the most important predictive variables and used these to simplify the predictive model. RESULTS: XGBoost with an undersampling method outperformed other methods with an AUC-ROC (area under curve of receiver operating characteristics) of 0.802 (95% CI: 0.801-0.803) and an F2 score of 0.533 (95% CI: 0.533-0.534). The predictive performance was equivalent if we selected half the number of variables based on the SHAP-value with an AUC-ROC of 0.804 (95% CI: 0.803-0.805) and F2 score of 0.536 (95% CI: 0.535-0.536). This consistency held for a significant reduction in variables with SHAP values of more than 70% from 523 to 150. CONCLUSION: While it is possible to predict prolonged LOS with a large complex set of variables, most models are difficult to use in clinical practice. SHAP-value-based variable selection allowed a reduction in the number of variables for equivalent predictive performance, making optimum prediction of prolonged LOS easier to implement in routine clinical practice by prioritizing certain predictive factors, allowing preventive measures to be taken for identified patients.