Abstract
OBJECTIVE: Early and accurate detection of ear diseases is essential for preventing hearing impairment and improving population health. This study aimed to develop a lightweight, high-performance, and real-time deep learning model for otoscopic image classification and to deploy it in a cross-platform diagnostic system for clinical and community use. METHODS: We constructed a large-scale, multi-center otoscopy dataset covering eight common ear diseases and healthy cases. Based on this dataset, we developed Best-EarNet, an ultrafast lightweight architecture integrating a local-global spatial feature fusion module and a multi-scale supervision strategy to enhance feature representation. Transfer learning was applied to optimize performance. The model was evaluated on internal (22,581 images) and external (1,652 images) test sets, with subgroup analyses by age and gender. Grad-CAM visualizations were used to improve interpretability. A cross-platform intelligent diagnostic system, Ear-Keeper, was further developed for deployment on smartphones, tablets, and personal computers. RESULTS: Best-EarNet achieved accuracies of 95.23% on the internal test set and 92.14% on the external test set, with a model size of 2.94 MB. It processed images at 80 frames per second on a standard CPU. Subgroup analyses demonstrated consistently high performance across age and gender groups. Grad-CAM visualizations highlighted lesion-related regions, and Ear-Keeper enabled real-time video-based ear screening across multiple platforms. CONCLUSION: Best-EarNet offers an accurate, efficient, and interpretable solution for ear disease classification. Its real-time performance and cross-platform deployment through Ear-Keeper support both clinical practice and community-level screening, with strong potential for early detection and intervention.