Using Large Language Model to Optimize Protein Purification: Insights from Protein Structure Literature Associated with Protein Data Bank

利用大型语言模型优化蛋白质纯化:来自蛋白质数据库相关蛋白质结构文献的启示

阅读:1

Abstract

Obtaining pure and homogeneous protein samples is vital for protein biology studies, yet optimizing protein expression and purification methods can be time-consuming because of variations in factors like expression conditions, buffer components, and fusion tags. With over 81 000 Protein Data Bank (PDB)-associated articles as of October 2024, manual extraction of relevant methods is impractical. To streamline this process, an automated tool is developed by incorporating a large language model (LLM) to extract and classify key data from these articles. The information extraction accuracy is enhanced by a 2-step-LLM and a 3-step-prompt. The key findings include: 1) Tris buffer is used in 49.2% of cases, followed by 4-(2-hydroxyethyl)-1-piperazineethanesulfonic acid (HEPES) and phosphate buffers. 2) Polyhistidine tags dominate at 82.5%, followed by glutathione S-transferase (GST) and maltose-binding protein (MBP) tags. 3) E. coli expression is done at 16-20 °C, with induction period favoring 12-16 h (69.0%) over 3-6 h (14.3%). The statistical analyses highlight the correlation between protein properties and purification strategies. This tool is validated through two case studies: method bias for membrane protein purification, and crosslinker/detergent preferences for Cryo-Electron Microscopy sample preparation. These findings provide a valuable resource for designing protein expression and purification experiments.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。