Abstract
BACKGROUND: This study establishes a computational framework for predictive style modeling in tobacco formulation design, addressing the critical disconnect between empirical approaches and blended system complexity. Herein, "style" refers to the characteristic sensory profiles (e.g., aroma, taste, and physiological sensations) intrinsically linked to cultivation regions, which arise from the unique combination of local environmental factors, such as climate and soil composition. A convolutional neural network (CNN) framework was developed to integrate conventional chemical indicators with thermogravimetric analysis-derived features from 434 geographically authenticated tobacco leaf samples. Through regionally constrained Monte Carlo sampling of composition ratios, 304,800 formulation data sets simulating real-world blending constraints were generated to enable robust model training. RESULTS: The leaf-centric CNN demonstrated remarkable region-style classification accuracy (99.54% via fivefold cross-validation), outperforming conventional machine learning models and revealing thermal-chemical complementarity in regional style characterization. However, direct application to blended formulations revealed a critical limitation: only 50.91% of blended formulations maintained stylistic consistency with their primary source leaves, underscoring the inadequacy of single-leaf model for blended systems. To overcome this, a unified CNN framework was trained on a consolidated multi-source data set encompassing both raw leaves and engineered blends, leveraging their shared feature space. This hybrid learning model achieved dual breakthroughs in regional style identification accuracy (90.09%) and leaf-to-blend style consistency (87.90%). Mechanistic analysis identified a nonlinear threshold effect, showing that primary source leaves maintained 99.91% stylistic dominance when exceeded 90% composition, decreasing to 67.90% at 30% composition. Significant formulation style deviation risks emerged when compositional gaps between principal and secondary source leaves narrowed below 10%. CONCLUSIONS: Building on these insights, a probabilistic style modulation strategy was proposed and validated through case applications, transforming theoretical discoveries into actionable design strategies. This innovation establishes region ratio constraints based on threshold-defined boundaries, creating a data-driven framework that systematically achieves target formulation style through the threshold's predictive capacity. This framework advances tobacco engineering from empirical practices to predictive digital transformation, providing a template for agricultural product manufacturing systems facing similar formulation challenges.