Abstract
Confounding factors in olfactory aroma data, such as high humidity levels, substantially affect sensor outputs, masking subtle volatile organic compound (VOC) patterns and hindering generalizable machine learning models. Traditional representation learning methods often require large datasets to mitigate confounder-induced variance, a resource unavailable in specialized sensor applications with limited data. This study presents Confounder-Invariant Representation Learning (CIRL), a method designed to mitigate confounding influences in data-scarce settings by leveraging explicit confounder information, such as relative humidity. CIRL enhances learned representations by reducing confounder effects, improving data purity and model robustness. Applied to three breath aroma datasets-acetone, ketosis, and peppermint-oil breath, all affected by high humidity-CIRL was integrated with standard autoencoder models. Evaluated within the same framework, CIRL improved generalization performance by 10-15% in classification accuracy across all three datasets. These results demonstrate CIRL's potential to advance reliable artificial olfaction for applications like breath-based diagnostics in challenging real-world conditions.