Abstract
Sepsis research has long been constrained by limited labeled data and models designed for specific tasks that primarily rely on tabular inputs, overlooking the valuable insights contained in clinical text. To address these limitations, we propose the Sepsis Data Representation Model (SepsisDRM), an embedding model that jointly processes tabular and textual data to capture comprehensive patient representations. Trained on a dataset comprising 19,526 sepsis patients, SepsisDRM demonstrates strong generalization across diverse sepsis-related tasks without task-specific tuning. It effectively stratifies patients into four clinically interpretable phenotypes and achieves robust performance in predicting 28-day outcomes, with AUC scores of 0.92, 0.94, and 0.78 on retrospective, prospective, and external datasets, respectively. As the first embedding model developed specifically for sepsis, SepsisDRM establishes a novel paradigm for sepsis research and offers a promising approach for studies in other fields that involve the integration of both tabular and textual data.