Abstract
High-resolution cervical auscultation (HRCA) has emerged as a promising noninvasive method for instrumental swallowing assessment, utilizing accelerometry signals from the patient's throat during swallowing. Compared to a traditional gold standard assessment technique, such as videofluoroscopic swallowing studies, HRCA offers reduced radiation exposure risks and increased accessibility. While previous studies have demonstrated the potential of machine learning in HRCA-based swallowing kinematics assessment, the accurate tracking of anatomical landmarks has remained a challenge. In this study, we propose a deep learning multi-task model that addresses this limitation by detecting the displacement of multiple anatomical structures during swallowing. By leveraging transformer encoders as sequential models, the proposed model tracks the displacement of the hyoid bone, laryngeal base, and hyolaryngeal approximation (HLA), which is the distance between the center of the hyoid bone and the laryngeal base and plays a crucial role in achieving safe and efficient swallows. For hyoid bone tracking, our model achieved an average relative overlapping (ROP) area exceeding 85%, exceeding the state-of-the-art by more than 30%. Additionally, the proposed model accurately tracks the laryngeal base with an average ROP exceeding 80% and predicts HLA distance in all frames with an average accuracy exceeding 95%, highlighting the transformative potential of our approach in encoding spatial information and the effect of the multi-task learning on tracking correlated structures. Our findings demonstrate significant promise in integrating HRCA into swallowing assessment protocols, marking a substantial advancement towards a noninvasive and comprehensive swallowing assessment method.