Abstract
The underwater acoustic environment is highly complex, where signals from various natural and anthropogenic sources interact and overlap, making monitoring efforts very challenging. Thus, effective detection and classification mechanisms are vital, as they provide key information about marine species and help in understanding how human activities influence the overall marine environment. This study proposes a deep learning-based framework for the automatic detection and classification of marine species vocalizations, inspired by the YOLO (You Only Look Once) architecture. However, a major limitation in developing such frameworks is the limited availability of continuous, well-annotated monitoring datasets that contain multi-species recordings. To address this limitation, synthetic monitoring datasets were constructed by combining single-species vocalizations to simulate realistic monitoring conditions under both non-overlapping and overlapping scenarios. Augmentation techniques, including CutMix, were implemented to enhance dataset diversity and improve the model's robustness against signal overlap. Experimental results demonstrated that the proposed model achieves strong performance under non-overlapping conditions and maintains stable detection and classification performance even in overlapping scenarios. These findings suggest that YOLO-inspired architectures can achieve effective performance across various acoustic conditions. Future studies should focus on incorporating continuous, long-term field recordings to further improve detection and classification reliability.