Abstract
Penalized regression methods that shrink model coefficients are popular approaches to improve prediction and for variable selection in high-dimensional settings. We present a penalized (or regularized) regression approach for multinomial logistic models for categorical outcomes with a novel adaptive L1-type penalty term, that incorporates weights based on intra- and inter-outcome category distances of each predictor. A predictor that has large between- and small within-outcome category distances is penalized less and has a higher likelihood to be selected for the final model. We propose and study three measures for weight calculation: an analysis of variance (ANOVA)-based measure and two indices used in clustering approaches. Our novel approach, that we term the discriminative power lasso (DP-lasso), thus combines elements of marginal screening with regularized regression methods. We studied the performance of DP-lasso and other published methods in simulations with varying numbers of outcome categories, numbers of predictors, strengths of associations and predictor correlation structures. For correlated predictors, the DP-lasso approach with ANOVA based weights (DPan) resulted in much sparser models than other regularization approaches, especially in high-dimensional settings. When the number p of (correlated) predictors was much larger than the available sample size N, DPan had the highest true positive rate while maintaining low false positive rates for all simulation settings. Similarly, when p < N , DPan had high true positive rates and the lowest false positive rates of all methods studied. Thus we recommend DPan for analysing categorical outcomes in relation to high-dimensional predictors. We further illustrate all approaches in ultra high-dimensional settings, using several single-cell RNA-sequencing datasets. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1007/s00180-025-01635-0.