Abstract
This study aims to utilize multi-omics high-throughput sequencing data, including ATAC-seq and RNA-seq data from TCGA, GTEx, and GEO databases, to construct predictive and prognostic models for lung adenocarcinoma (LUAD) and identify potential biomarkers. We first obtained LUAD ATAC-seq data from TCGA and identified differential chromatin regions and genes through functional analysis. Differential peaks (DPs) potentially influencing LUAD progression were determined by analyzing patients at different stages, and these DPs were annotated to the genome to obtain differential peak genes (DPGs). We then integrated RNA-seq data from GTEx and TCGA to identify differentially expressed genes (DEGs) at the mRNA level, and by intersecting DEGs with DPGs, we identified 337 consensus genes (CGs). Using random forest and LASSO algorithms, we screened the CGs and constructed a predictive model comprising nine predictive-related genes (Pre-RGs), which was validated with an external dataset (GSE140343). Additionally, through Kaplan-Meier and Cox analyses combined with LASSO, five prognostic-related genes (Pro-RGs) were identified and used to establish a prognostic Cox proportional hazards model, also validated by GSE140343. Single-cell dataset analysis examined the expression of Pre-RGs and Pro-RGs across immune cell types, and further meta-analysis in the LCE database verified their expression differences and prognostic significance. Furthermore, we sequenced cell-free RNAs (cfRNAs) from 50 plasma samples (25 early-stage lung cancer and 25 benign pulmonary disease cases) to validate early cancer detection. Overall, we identified signatures including S100A8, GPM6A, FEZ1, OTX1, DNAH14, XDH, XPR1, SLC39A11, OCIAD2, TNS4, RHOV, YWHAZ, CLEC12A, and CASZ1, which show potential as drug targets and biomarkers for predicting LUAD development, prognosis, and early detection. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12885-025-14943-x.