Abstract
CONTEXT: Bone fractures are among the most common musculoskeletal injuries and require timely, accurate diagnosis to ensure effective treatment and prevention of long-term complications. However, manual interpretation of X-ray images is often error-prone, and subtle cases such as hairline fractures. OBJECTIVE: This study proposes Swin-EffuseNet, a dual-stream deep learning (DL) framework that integrates Swin Transformer V2 and EfficientNet-B0 via attention-based fusion, combining global semantic features with fine-grained local textures for robust fracture classification into four categories: No Fracture, Hairline, Simple, and Complex. METHOD: A total of 4370 X-ray images were curated from two publicly available datasets: FracAtlas and the Bone Break Classification Dataset. Swin Transformer V2 extracts semantic features and EfficientNet-B0 extracts texture features were fused using attention-based fusion before final classification. An external validation dataset comprising the Hairline Fracture Detection v2 and Bone Fracture X-ray Simple vs. Comminuted Fractures datasets was employed to assess generalizability. RESULT: Swin-EffuseNet achieved 92.8 % accuracy, 92.4 % precision, 91.6 % recall, 91.9 % F1-score, 0.957 Receiver Operating Characteristic - Area Under the Curve (ROC-AUC), and the lowest log-loss of 0.227. Class-wise accuracies were 91.5 % (No Fracture), 87.9 % (Hairline), 90.4 % (Simple), and 94.6 % (Complex). Statistical testing confirmed significant improvements over Swin Transformer V2 (89.5 %, +3.5 %) and EfficientNet-BO (88.4 %, +6.7 %). The framework also achieved strong validation performance. With an inference time of 2.8 ms per image and interpretability supported by Gradient-weighted Class Activation Mapping (Grad-CAM) and t-Distributed Stochastic Neighbour Embedding (t-SNE).. CONCLUSION: Swin-EffuseNet provides an accurate, efficient, and interpretable solution for intelligent fracture classification, supporting scalable deployment in diagnostic workflows.