Abstract
BACKGROUND: Recent technological advances have enabled the simultaneous collection of multi-omics data, i.e., different types or modalities of molecular data. Integrative predictive modeling of such data is particularly challenging. Ideally, data from the different modalities are measured in the same individuals, allowing for early or intermediate integrative techniques. However, they are often not applicable when patient data only partially overlap, which requires either reducing the datasets or imputing missing values. Additionally, the diversity of data modalities may necessitate specific statistical methods rather than applying the same method across all modalities. Late integration modeling approaches analyze each data modality separately to obtain modality-specific predictions. These predictions are then aggregated into a meta-model by training a machine learning (ML) model, or by computing the weighted mean of modality-specific predictions. RESULTS: We introduce the R package fuseMLR for late integration prediction modeling. The package is user-friendly, enables variable selection and the application of different ML algorithms for each modality, and automatically performs aggregation once modality-specific training is completed. We illustrate the package’s functionality in a small simulation study and with two publicly available multi-omics datasets from The Cancer Genome Atlas. CONCLUSION: The package fuseMLR enables predictive modeling with late integration in a systematic, structured, and reproducible way. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-025-06248-4.