Efficiently quantifying dependence in massive scientific datasets using InterDependence Scores.

Large-scale scientific datasets today contain tens of thousands of random variables across millions of samples (for example, the RNA expression levels of 20,000 protein-coding genes across 30 million single cells). Being able to quantify dependencies between these variables would help us discover novel relationships between variables of interest. Simple measures of dependence, such as Pearson correlation, are fast to compute, but limited in that they are designed to detect linear relationships between variables. Complex measures are known with the ability to detect any kind of dependence, but they do not readily scale to many modern datasets of interest. We introduce the InterDependence Score (IDS), a scalable measure of dependence that captures linear and various nonlinear dependencies between random variables. Our IDS algorithm is motivated by a dependence measure defined in infinite-dimensional Hilbert spaces, capable of capturing any type of dependence, and a fast (linear time) algorithm that neural networks natively implement to compute dependencies between random variables. We apply IDS to identify 1) relevant variables for predictive modeling tasks, 2) sets of words forming topics from millions of documents, and 3) sets of genes related to "gene-expression programs" in tens of millions of cells. We provide an efficient implementation that computes IDS between billions of pairs of variables across millions of samples in several hours on a single GPU. Given its speed and effectiveness in identifying nonlinear dependencies, we envision IDS will be a valuable tool for uncovering insights from scientific data.

期刊：	Proceedings of the National Academy of Sciences of the United States of America	影响因子：	9.100
时间：	2025	起止号：	2025 Aug 26; 122(34):e2509860122
doi：	10.1073/pnas.2509860122

Efficiently quantifying dependence in massive scientific datasets using InterDependence Scores.

特别声明