Abstract
INTRODUCTION: Life has the property to produce from a single genome, the collection of DNA molecules, different cell types, as well as mechanisms for bacteria to adapt to environmental changes. Although regulation can happen at different levels, regulation of transcription initiation, the start of copying DNA into RNA, is the most studied level in bacteria. The collection of regulators and their regulated elements defines transcriptional regulatory networks (TRNs), whose study has driven relevant areas, such as antimicrobial resistance. Their analyses and understanding depend on some few highly manually curated databases. The traditional way to reconstruct these networks is by manual curation of the literature, which is accurate, but also demanding and time-consuming. These limitations have resulted in the shortage and incompleteness of bacterial TRNs. METHODS: Here, we present a novel ensemble model approach using two GPT-based foundation models (LLaMA-3 and GPT-4o mini) to effectively reconstruct TRNs from the literature. We applied a supervised fine-tuning strategy with sentences from Escherichia coli literature to train models to predict the type of regulatory effect between a transcription factor and a regulated element (gene/operon). To evaluate the performance of reconstructing a curated TRN, we used 264 full-text articles of Salmonella Typhimurium, a pathogen of clinical interest. RESULTS: With the test data, both models obtained significant performance (F1-Score > 0.87, Matthews correlation coefficient > 0.82). For the curated TRN reconstruction, the ensemble approach using the agreement of models correctly reconstructed 80% of the TRN (Recall: 0.80, F1-score: 0.64). We applied the approach to reconstruct a large Salmonella TRN using the literature available at the time on transcriptional regulation of this bacterium (2,278 articles). This network was described with network metrics, over-representation analyses, and compared to existing biological knowledge. DISCUSSION: Our approach overtook the performance of prior works predicting the effect of the interaction. The analysis of the TRN of the 2,278 articles showed the effectiveness of our approach to reconstruct TRNs of diverse bacteria, as the network aligns with biological knowledge. Thus, our work may support the study of bacteria of biological and clinical interest, especially those without a reconstructed TRN.