Using decision trees to understand structure in missing data

利用决策树理解缺失数据的结构

阅读:1

Abstract

OBJECTIVES: Demonstrate the application of decision trees--classification and regression trees (CARTs), and their cousins, boosted regression trees (BRTs)--to understand structure in missing data. SETTING: Data taken from employees at 3 different industrial sites in Australia. PARTICIPANTS: 7915 observations were included. MATERIALS AND METHODS: The approach was evaluated using an occupational health data set comprising results of questionnaires, medical tests and environmental monitoring. Statistical methods included standard statistical tests and the 'rpart' and 'gbm' packages for CART and BRT analyses, respectively, from the statistical software 'R'. A simulation study was conducted to explore the capability of decision tree models in describing data with missingness artificially introduced. RESULTS: CART and BRT models were effective in highlighting a missingness structure in the data, related to the type of data (medical or environmental), the site in which it was collected, the number of visits, and the presence of extreme values. The simulation study revealed that CART models were able to identify variables and values responsible for inducing missingness. There was greater variation in variable importance for unstructured as compared to structured missingness. DISCUSSION: Both CART and BRT models were effective in describing structural missingness in data. CART models may be preferred over BRT models for exploratory analysis of missing data, and selecting variables important for predicting missingness. BRT models can show how values of other variables influence missingness, which may prove useful for researchers. CONCLUSIONS: Researchers are encouraged to use CART and BRT models to explore and understand missing data.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。