Abstract
Modern virtual high-throughput screening (VHTS) pipelines can be suboptimally validated, with no rigorous studies conclusively demonstrating that every one of their steps reliably adds increasing enrichment atop the baseline random hit rate. Moreover, what little benchmarking studies are available primarily focus on the docking aspect of the pipelines, which is usually only the beginning or near the beginning, and even there, authors tend to use flawed data sets that artificially inflate performance metrics. Herein, we present an alternative method to pipeline validation and data set generation that requires no additional experimental work and expenditure, yet offers vast amounts of negative data that can be used in VHTS pipeline validation. By randomizing ligands across published experimental structures and generating structural isomers of known binders, practically unlimited amounts of negative data can be generated. Such sets of positive and negative data points match closely in key molecular properties and are well suited to pipeline validation. Once such sets are generated, they are to be run through any proposed pipeline, assessing performance at every step. We stress the importance of using negative data of adequate quality and quantity in validation studies to definitively and verifiably demonstrate the utility of a given tool or workflow. Our goal is to help distinguish tools and pipelines that truly accelerate hit discovery and lead optimization from ones that promise to do so but actually do not, whereupon academia and industry can begin to tackle the many unaddressed medical needs of the 21st century.