Abstract
Urban bus accidents present major safety and operational challenges, particularly in densely populated metropolitan areas. This study develops a machine learning-based analytical framework to identify, quantify, and interpret the factors associated with severe bus accidents. The framework integrates three components: (i) a structural topic model (STM) to extract latent accident scenarios from unstructured narrative data, (ii) an extreme gradient boosting (XGBoost) classifier to predict accident severity, and (iii) SHapley Additive exPlanations (SHAP) for post hoc interpretation of model outputs at both global and local levels. Using over 15,000 bus accident records (2013-2018) from a Tier-2 city in Jiangsu Province, China, the findings show that incorporating text-derived accident patterns markedly improves both predictive accuracy and interpretability. The analysis highlights elevated risks linked to rear-end collisions involving electric scooters, sudden stops leading to passenger injuries, and left-turn maneuvers in congested areas. SHAP-based explanations yield actionable insights for drivers, transit operators, and policymakers, facilitating targeted safety interventions. Methodologically, this study advances interpretable risk modeling through the integration of structured and unstructured data, and the modular analytical framework provides a transferable foundation for applications across diverse domains of transportation and risk analysis.