An active learning pipeline to automatically identify candidate terms for a CDSS ontology-measures, experiments, and performance

一种用于自动识别CDSS本体候选术语的主动学习流程——度量、实验和性能

阅读:2

Abstract

OBJECTIVE: To explore new strategies to make the document selection process more transparent, reproducible, and effective for the active learning process. The ultimate goal is to leverage active learning in identifying keyphrases to facilitate ontology development and construction, to streamline the process, and help with the long-term maintenance. METHODS: The active learning pipeline used a BILSTM-CRF model and over 2900 abstracts retrieved from PubMed relevant to clinical decision support systems. We started the model training with synthetic labeled abstracts, then used different strategies to select domain experts' annotated abstracts (gold standards). Random sampling was used as the baseline. Recall, F1 (beta = 1, 5, and 10) scores are used as measures to compare the performance of the active learning pipeline by different strategies. RESULTS: We tested four novel document-level uncertainty aggregation strategies-KPSum, KPAvg, DOCSum, and DOCAvg-that operate over standard token-level uncertainty scores such as Maximum Token Probability (MTP), Token Entropy (TE), and Margin. All strategies show significant improvement in early active learning cycles (θ(0) to θ(2)) for recall and F1. The systematic evaluations show that KPSum (actual order) shows consistent improvement in both recall and F1 and KPSum (actual order) shows better results than the random sampling results. The document order (actual versus reverse) does not seem to play a critical role across strategies in model learning and performance in our datasets, although in some strategies, actual order shows slightly more effective results. The weighted F1 (beta = 5 and 10) provided complementary results to raw recall and F1 (beta = 1). CONCLUSION: While prior work on uncertainty sampling typically focuses on token-level uncertainty metrics within generic NER tasks, our work advances this line of research by introducing a higher-level abstraction: document-level uncertainty aggregation. With a human-in-the-loop Active Learning pipeline, it can effectively prioritize high-impact documents, improve early-cycle recall, and reduce annotation effort. Our results show promise in automating part of ontology construction and maintenance work, i.e., monitoring and screening new publications to identify candidate keyphrases. However, future work needs to improve the model performance to make it usable in real-world operations.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。