Abstract
In chemical reaction processes, yield prediction frequently faces challenges, such as multi-variable coupling, significant nonlinearity, and the limited accuracy of traditional mechanistic models. This study develops a datadriven prediction model that integrates the genetic algorithm (GA) with CatBoost to address these challenges. Four variables, including reactant ratio (n-butanol to trioxane), reaction temperature, reaction time, and catalyst concentration, were selected as model inputs based on 88 sets of experimental data. The model outputs focused on the yield of polymethoxy dibutyl ether with a polymerization degree of 1 (BTPOM(1)) and the total yield of polymethoxy dibutyl ether with polymerization degrees of 1 to 8 (BTPOM(1-8)). The model achieved automatic optimization of CatBoost on hyperparameters by combining a hybrid-coding genetic algorithm. The results demonstrated that the GACatBoost model significantly outperformed GAAdaBoost for both datasets: for BTPOM(1), it reduced the mean squared error (MSE) by 50.1%, mean absolute error (MAE) by 40.6%, and mean absolute percentage error (MAPE) by 17.8% relative to GAAdaBoost. For BTPOM(1-8), the reductions were more pronounced, with MSE decreasing by 54.0%, MAE by 45.0%, and MAPE by 33.8% compared to GAAdaBoost. Additionally, the GACatBoost model significantly outperformed three classical machine learning algorithms: Support Vector Regression (SVR), Random Forest (RF), and KNearest Neighbor (KNN). Feature importance analysis revealed that reaction time and reaction temperature are the key factors influencing BTPOM(n) yield. This research provides a feasible approach for accurate synthesis yield prediction and process optimization under small sample conditions. It is particularly valuable for early-stage laboratory research where experimental data is often limited.