Curation at Scale with EPITOME: Extraction Pipeline for Immunological Texts and Open-Source Multimodal Enquiry

利用 EPITOME 进行大规模内容管理:免疫学文本和开源多模态查询的提取流程

阅读:3

Abstract

The Immune Epitope Database (IEDB, iedb.org) has manually curated epitope data from over 26,000 publications across two decades. With PubMed adding ~5,000 articles daily, traditional curation methods face scalability challenges. Given the multimodality of data contained in scientific papers, we have sought to build an open-source vision language model (VLM)-based tool that human curators can use to speed up and automate biological data curation. Here we present a multimodal document ingestion and Question-Answering (QnA) pipeline that ties traditional Optical Character Recognition (OCR) and text matching with Vision-Language Model (VLM) capabilities. The system, which we call EPITOME, implements three-stage processing: regex-based epitope and MHC molecule identification, visual element extraction from PDFs, and contextual indexing that links peptide sequences, MHC molecules, and assays to their locations across text, tables, and figures. This indexing is used to supply context for further VLM QnA. Our preliminary results from EPITOME demonstrate promising zero-shot performance of open-source VLMs that suggest promise for accelerating biocuration through a curator-in-the-loop process, with our evaluation identifying strategic points where curator-in-the-loop intervention can enhance overall system accuracy.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。