Abstract
BACKGROUND: Hepatocellular carcinoma (HCC) remains a leading cause of cancer-related mortality worldwide, largely because of challenges in early diagnosis and the limited sensitivity of conventional biomarkers. Therefore, reliable molecular tools for early detection, prognostic stratification, and individualized treatment predictions are urgently required. METHODS: This retrospective study analyzed publicly available gene expression datasets. Candidate biomarkers were identified from the GSE14520 cohort using a multistep screening workflow that integrated differential expression analysis, diagnostic performance, and prognostic relevance. A 10-gene diagnostic model was constructed using least absolute shrinkage and selection operator logistic regression and subsequently validated across multiple independent cohorts. Survival outcomes were evaluated using the Kaplan-Meier analysis and treatment responses to sorafenib and transarterial chemoembolization (TACE) were assessed using receiver operating characteristic analysis. RESULTS: A 10-gene signature (TOP2A, CDK1, CYP3A4, MASP2, EPHX2, HAO1, RACGAP1, GLYAT, ADH1B, and CYP4A11) was established. The model demonstrated robust internal performance and consistent accuracy across external validation cohorts (area under the curve [AUC], >0.9). This signature effectively identified early-stage HCC and distinguished malignancy from cirrhosis. High-risk scores were significantly associated with poor overall survival and recurrence-free survival (p<0.05). Furthermore, the model could predict treatment sensitivity, with higher risk scores associated with better outcomes for sorafenib (AUC, 0.791), whereas lower risk scores correlated with an improved response to TACE (AUC, 0.768). CONCLUSION: Our gene expression-based machine learning model provides a robust tool for HCC diagnosis, prognosis, and treatment response prediction, with potential as a supportive system for personalized clinical decision-making.