Abstract
Missing data in morphological trait datasets pose a persistent challenge to ecological and evolutionary research, frequently compromising model inference and predictive accuracy. We propose THORBFNN, a three-stage hybrid imputation framework that integrates regularized K-means clustering, Radial Basis Function Neural Networks (RBFNNs), and hierarchical Bayesian optimization to accurately recover missing avian morphological traits. The framework partitions species into clusters using regularized K-means, enhancing the preservation of local morphological structure through inter-cluster separation. Within each cluster, RBFNNs model nonlinear dependencies among traits using input features selected by Pearson correlation with the target trait. Key hyperparameters such as the number of clusters and RBF width are optimized via hierarchical Bayesian optimization to balance generalization and model complexity. When applied to a global avian trait dataset comprising over 10,000 individuals and 11 morphological traits, THORBFNN outperforms K-nearest neighbors and Random Forest imputation across four focal traits, achieving higher R (2) and lower errors (THORBFNN: R (2) = 0.9003, RMSE = 0.1652, MAE = 0.1096; KNN: R (2) = 0.8864, RMSE = 0.1668, MAE = 0.1248; Random Forest: R (2) = 0.8573, RMSE = 0.2134, MAE = 0.1584). Ablation experiments comparing models trained on complete cases versus mean-imputed data confirm that THORBFNN captures genuine trait covariation rather than statistical artifacts. THORBFNN requires no phylogenetic information and scales efficiently to datasets with thousands of individuals, offering a practical pathway for integrating machine learning into biodiversity trait analysis.