Abstract
Artificial intelligence (AI) has shown its potential to advance applications in various medical fields. One such area involves developing integrated AI-based systems to assist in laparoscopic surgery. Surgical tool detection and phase recognition are key components to develop such systems, and therefore, they have been extensively studied in recent years. Despite significant advancements in this field, previous image-based methods still face many challenges that limit their performance due to complex surgical scenes and limited annotated data. This study proposes a novel deep learning approach for classifying and localizing surgical tools in laparoscopic surgeries. The proposed approach uses a self-supervised learning algorithm for surgical tool classification followed by a weakly supervised algorithm for surgical tool localization, eliminating the need for explicit localization annotation. In particular, we leverage the Bidirectional Encoder Representation from Image Transformers (BEiT) model for tool classification and then utilize the heat maps generated from the multi-headed attention layers in the BEiT model for the localizing of these tools. Furthermore, the model incorporates class weights to address the class imbalance issue resulting from different usage frequencies of surgical tools in surgeries. Evaluated on the Cholec80 benchmark dataset, the proposed approach demonstrated high performance in surgical tool classification, surpassing previous works that utilize both spatial and temporal information. Additionally, the proposed weakly supervised learning approach achieved state-of-the-art results for the localization task.