A review of machine learning methods for imbalanced data challenges in chemistry

化学领域不平衡数据挑战的机器学习方法综述

阅读:2

Abstract

Imbalanced data, where certain classes are significantly underrepresented in a dataset, is a widespread machine learning (ML) challenge across various fields of chemistry, yet it remains inadequately addressed. This data imbalance can lead to biased ML or deep learning (DL) models, which fail to accurately predict the underrepresented classes, thus limiting the robustness and applicability of these models. With the rapid advancement of ML and DL algorithms, several promising solutions to this issue have emerged, prompting the need for a comprehensive review of current methodologies. In this review, we examine the prominent ML approaches used to tackle the imbalanced data challenge in different areas of chemistry, including resampling techniques, data augmentation techniques, algorithmic approaches, and feature engineering strategies. Each of these methods is evaluated in the context of its application across various aspects of chemistry, such as drug discovery, materials science, cheminformatics, and catalysis. We also explore future directions for overcoming the imbalanced data challenge and emphasize data augmentation via physical models, large language models (LLMs), and advanced mathematics. The benefit of balanced data in new material design and production and the persistent challenges are discussed. Overall, this review aims to elucidate the prevalent ML techniques applied to mitigate the impacts of imbalanced data within the field of chemistry and offer insights into future directions for research and application.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。