Abstract
Accurate detection of partial discharges (PDs) in medium-voltage overhead transmission lines is critical for preemptive maintenance and avoiding costly outages, yet it is challenged by scarce labeled data and pervasive electromagnetic interference. This paper investigates a hybrid simulation-and-data-driven framework in which synthetically generated PD signals are used to pretrain deep neural networks and are subsequently fine-tuned on a limited set of real overhead-line measurements. The synthetic pipeline systematically varies PD repetition rates, amplitude distributions, vegetation-contact scenarios, and noise conditions, producing diverse time-series and spectrogram-like representations that approximate real operating environments. We conduct a comprehensive ablation study across multiple architectures-Convolutional Neural Networks (CNNs), a Vision Transformer (ViT), and a Long Short-Term Memory (LSTM) network-and analyze their sensitivity to granular sweeps of synthetic-data parameters. CNN-based models decisively outperform ViT and LSTM counterparts on the spectrogram-based classification task, while ViT and LSTM fail to learn meaningful representation. For the successful CNNs, pretraining on carefully parameterized synthetic datasets-particularly those reflecting higher PD activity, such as our Datasets 3 and 4-consistently improves downstream performance on real data, boosting the Matthews Correlation Coefficient (MCC) on imbalanced, cost-sensitive test sets by roughly 10-20% compared with training from scratch. At the same time, we show that poorly aligned synthetic data can degrade generalization, underscoring the need for accurate noise calibration and domain-aligned simulation. Overall, the results confirm that (i) architectural choice is pivotal for PD detection in overhead lines and (ii) well-designed synthetic data is a powerful, practical lever for achieving reliable and cost-effective PD monitoring when real labeled data are limited.