Abstract
Background: Chest X-ray (CXR) is widely used for the assessment of thoracic diseases, yet automated multi-label interpretation remains challenging due to subtle visual patterns, overlapping anatomical structures, and frequent co-occurrence of abnormalities. While recent deep learning models have shown strong performance, limitations in interpretability, anatomical awareness, and robustness continue to hinder their clinical adoption. Methods: The proposed framework employs a hybrid ConvNeXtV2-Vision Transformer (ViT) architecture that combines convolutional feature extraction for capturing fine-grained local patterns with transformer-based global reasoning to model long-range contextual dependencies. The model is trained exclusively using image-level annotations. In addition to classification, three complementary post hoc components are integrated to enhance model trust and interpretability. A segmentation-aware Gradient-weighted class activation mapping (Grad-CAM) module leverages CheXmask lung and heart segmentations to highlight anatomically relevant regions and quantify predictive evidence inside and outside the lungs. An ontology-driven neuro-symbolic reasoning layer translates Grad-CAM activations into structured, rule-based explanations aligned with clinical concepts such as "basal effusion" and "enlarged cardiac silhouette". Furthermore, a lightweight out-of-distribution (OOD) detection module based on confidence scores, energy scores, and Mahalanobis distance scores is employed to identify inputs that deviate from the training distribution. Results: On the VinBigData test set, the model achieved a macro-AUROC of 0.9525 and a Micro AUROC of 0.9777 when trained solely with image-level annotations. External evaluation further demonstrated strong generalisation, yielding macro-AUROC scores of 0.9106 on NIH ChestXray14 and 0.8487 on CheXpert (frontal views). Both Grad-CAM visualisations and ontology-based reasoning remained coherent on unseen data, while the OOD module successfully flagged non-thoracic images. Conclusions: Overall, the proposed approach demonstrates that hybrid convolutional neural network (CNN)-vision transformer (ViT) architectures, combined with anatomy-aware explainability and symbolic reasoning, can support automated chest X-ray diagnosis in a manner that is accurate, transparent, and safety-aware.