Abstract
Objective This study aimed to identify key genes associated with diabetic retinopathy (DR) by applying bioinformatics and machine learning techniques to publicly available transcriptomic datasets. We further evaluated their diagnostic performance and explored their potential biological functions and upstream regulatory mechanisms, providing a theoretical basis for the early diagnosis and molecular-targeted therapy of DR. Methods DR-related transcriptomic datasets GSE94019 and GSE60436 were obtained from the Gene Expression Omnibus (GEO) database, with GSE94019 serving as the training set and GSE60436 as the validation set. The data were then subjected to normalization and differential expression analysis. Feature genes were selected using the Least Absolute Shrinkage and Selection Operator (LASSO) regression and Support Vector Machine-Recursive Feature Elimination (SVM-RFE) algorithms. Overlapping genes were identified as key candidates. Diagnostic performance was evaluated by plotting receiver operating characteristic (ROC) curves using the R package pROC. Functional enrichment analysis, including Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analyses, was performed on differentially expressed genes (DEGs) associated with the key gene. Potential upstream miRNAs and lncRNAs were predicted using the miRanda, miRDB, TargetScan, and spongeScan databases, and a lncRNA-miRNA-mRNA regulatory network was constructed. Results A total of 790 DEGs were identified, including 370 upregulated and 419 downregulated genes. Cross-validation using LASSO and SVM-RFE identified Collagen Type VI Alpha 2 Chain (COL6A2) and LINC01247 as key genes. COL6A2 was significantly upregulated in the DR group. ROC analysis revealed high diagnostic accuracy, with area under the curve (AUC) values of 1.00 (training set) and 0.89 (validation set). In contrast, LINC01247 was significantly downregulated, but its AUC values were 1.00 (training set) and 0.52 (validation set), indicating limited diagnostic value; thus, it was excluded from further analysis. Functional enrichment centered on COL6A2 suggested that its associated DEGs were involved in aberrant extracellular matrix (ECM) organization, cell adhesion, angiogenesis, and inflammatory responses. Moreover, regulatory network analysis indicated that hsa-miR-762 and hsa-miR-29a-3p may indirectly regulate COL6A2 expression by competitively binding multiple lncRNAs (e.g., PABPC1L2B-AS1 and RP11-223P11.3), forming a potential ceRNA regulatory axis. Conclusion This study identifies COL6A2 as a key gene in DR, characterized by significant upregulation in DR tissues and close involvement in ECM remodeling, cell adhesion, and angiogenesis. These findings provide novel molecular targets and theoretical insights for elucidating the molecular mechanisms of DR and for improving early diagnostic strategies.