Dataset meta-level and statistical features affect machine learning performance

数据集的元级和统计特征会影响机器学习性能

阅读:1

Abstract

What dataset features affect machine learning (ML) performance has primarily been unknown in the current literature. This study examines the impact of tabular datasets' different meta-level and statistical features on the performance of various ML algorithms. The three meta-level features this study considered are the dataset size, the number of attributes and the ratio between the positive (class 1) and negative (class 0) class instances. It considered four statistical features for each dataset: mean, standard deviation, skewness and kurtosis. After applying the required scaling, this study averaged (uniform and weighted) each dataset's different attributes to quantify its four statistical features. We analysed 200 open-access tabular datasets from the Kaggle (147) and UCI Machine Learning Repository (53) and developed ML classification models (through classification implementation and hyperparameter tuning) for each dataset. Then, this study developed multiple regression models to explore the impact of dataset features on ML performance. We found that kurtosis has a statistically significant negative effect on the accuracy of the three non-tree-based ML algorithms of the Support vector machine (SVM), Logistic regression (LR) and K-nearest neighbour (KNN) for their classical implementation with both uniform and weighted aggregations. This study observed similar findings in most cases for ML implementations through hyperparameter tuning, except for SVM with weighted aggregation. Meta-level and statistical features barely show any statistically significant impact on the accuracy of the two tree-based ML algorithms (Decision tree and Random forest), except for implementation through hyperparameter tuning for the weighted aggregation. When we excluded some datasets based on the imbalanced statistics and a significantly higher contribution of one attribute compared to others to the classification performance, we found a significant effect of the meta-level ratio feature and statistical mean and standard deviation features on SVM, LR and KNN accuracy in many cases. Our findings open a new research direction in understanding how dataset characteristics affect ML performance and will help researchers select appropriate ML algorithms for a possible optimal accuracy outcome.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。