Multivariate Optimization of k for k-Nearest-Neighbor Feature Selection With Dichotomous Outcomes: Complex Associations, Class Imbalance, and Application to RNA-Seq in Major Depressive Disorder

基于多元优化的k近邻特征选择算法在二分结果中的应用:复杂关联、类别不平衡及其在重度抑郁症RNA测序中的应用

阅读:1

Abstract

Optimization of nearest-neighbor feature selection depends on the number of samples and features, the type of statistical effect, the feature scoring algorithm, and class imbalance. We recently reported a fixed-k for Nearest-neighbor Projected-Distance Regression (NPDR) that addresses each of these parameters, except for class imbalance. To remedy this, we parameterize our NPDR fixed-k by the minority class size (minority-class-k). We also introduce a class-adaptive fixed-k (hit-miss-k) to improve performance of Relief-based algorithms on imbalanced data. In addition, we present two optimization methods, including constrained variable-wise optimized k (VWOK) and a fixed-k derived with principal components analysis (kPCA), both of which are adaptive to class imbalance. Using simulated data, we show that our methods significantly improve feature detection across a variety of nearest-neighbor feature scoring metrics, and we demonstrate superior performance in comparison to random forest and ridge regression using consensus-nested cross-validation (cnCV) for feature selection. We applied cnCV to RNASeq expression data from a study of Major Depressive Disorder (MDD) using NPDR with minority-class-k, random forest, and cnCV-ridge regression for gene importance. Pathway analysis showed that NPDR with minority-class-k alone detected genes with clear relevance to MDD, suggesting that our new fixed-k formula is an effective rule-of-thumb.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。