Abstract
To expedite the early stages of drug development for diseases lacking established target databases, and to enhance knowledge updating in well-studied disease domains, this paper introduces TARGETFLOW, an automated literature-mining pipeline. The workflow begins by automatically retrieving literature, downloading relevant abstracts, and constructing a comprehensive database. After performing selective text cleaning and data preprocessing, it leverages large language models (LLMs) to conduct intelligent literature screening, followed by code-based whitespace tokenization. Subsequently, rule-based filtering is applied to extract high-potential therapeutic targets for the specified disease. To validate the effectiveness of this pipeline, three hypotheses were formulated: (1) An effective pipeline should be capable of identifying high-potential therapeutic targets for the given disease; (2) For diseases with established target databases, the pipeline should be able to detect novel and emerging targets not yet included in existing databases; and (3) The pipeline should also be applicable to rare or emerging diseases that lack mature target databases. Then, rheumatoid arthritis (RA), a common disease, and idiopathic pulmonary fibrosis (IPF), a rare disease, were selected as case studies. The results demonstrated the method's reliability (high-potential target validation rate: 56 %), innovativeness (new target validation pass rate: 100 %), and generalizability (IPF target literature support rate: 88.9 %).