Abstract
Deep learning-based automated call detectors offer a solution to the labour- and time-intensive problem of isolating animal sounds within the enormous datasets generated by passive acoustic monitoring (PAM). Broad adoption of deep learning systems in PAM has been hampered by the need for large, labelled training datasets, which do not exist for species whose calls are seldom recorded. Additionally, significant computational resources are required to train many deep learning detectors, making them expensive both monetarily and in terms of energy use. We present an automated detection framework that mitigates these issues for animals that produce stereotyped sounds. First, we produce a semi-synthetic training dataset using a physically motivated data augmentation pipeline that introduces realistic variation into duplicates of a single exemplar recording of the target sound. Second, we fine-tune a pretrained neural network with transfer learning, allowing training on consumer-grade hardware in a matter of hours. Third, we demonstrate our detector on two baleen whale vocalisations and test the performance against ground truth annotations. Our best performing model achieved a recall of 99.4%, precision of 91.2% and an F(1) score of 95.1%, matching or outperforming similar detectors, and having been trained on a dataset built from a single example of the target call. We propose that our framework advances the utility of deep learning detectors for baleen whales and likely other rare or elusive animals that produce tightly stereotyped vocalisations. The trained model and all associated code are made freely available, with the goal of reducing barriers to the use of deep learning detectors for the study of data-scarce, stereotyped animal sounds. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1038/s41598-026-48308-6.