Abstract
The method of pre-training on extensive unlabeled data, followed by transferring the learned representations to downstream tasks with limited labeled data, has been effective in various fields. However, this paradigm faces the challenge of extreme data imbalance in cervical cytology, with positive cells in whole-slide images constituting approximately 1%. In this paper, we propose a pipeline for investigating the impact of this extreme category imbalance on self-supervised representation learning (SSRL). The pipeline consists of 2 stages: SSRL and downstream tasks. In the SSRL stage, we employ 2 well-established methods, masked autoencoders and the simple framework for contrastive learning, across 9 datasets with varying degrees of imbalance. The pre-trained representations are then transferred to downstream tasks by employing both linear probing and fine-tuning techniques. Additionally, we examine the effect of SSRL on annotation efficiency by varying the quantities of annotation (annotation budget). Our investigation leverages a total of 168,000 image tiles derived from 1,320 whole-slide images obtained from multiple centers. Our findings indicate a noticeable decline in accuracy (Acc) within downstream tasks as data balance shifts from 1:1 to 1:100, with a maximum drop of about 4%. This highlights the substantial impact of data imbalance on SSRL, particularly evident in downstream tasks with lower annotation rates, such as at a 1% budget. Furthermore, the downstream tasks demonstrate the potential to achieve accuracy comparable to those of scenarios with a high annotation budget (50%), even when utilizing a limited annotation budget (5%). The code is available at https://github.com/LGBluesky/ICISSRL.