Abstract
Loss-of-function variants (LoFs) can result in severe clinical phenotypes, including both autosomal-recessive and -dominant Mendelian diseases. Except for a handful of unusually common variants, however, their lifetime risk for disease expression is unknown. This is particularly true for LoFs in genes linked to autosomal-dominant diseases driven by haploinsufficiency, which represent some of the most common monogenic disorders. Here, we investigate the disease-expression rates for >6,000 predicted LoFs (pLoFs) linked to 91 haploinsufficient diseases using the electronic health records (EHRs) of ∼24,000 pLoF heterozygotes isolated from two population-scale biobanks (the UK Biobank and the All of Us Research Program). Consistent with prior analyses, most pLoF heterozygotes displayed no evidence for disease expression, a phenomenon that persisted after accounting for variant annotation artifacts, missed diagnoses, and incomplete clinical data. While it is infeasible to completely remove all the artifacts and biases from EHR data, we hypothesized that many of these pLoFs have intrinsically low or even no penetrance, which may be driven by residual allelic activity. To test this, we trained machine-learning models to predict disease-expression risk for pLoFs using only their genomic features. In validation experiments, the models were predictive of pLoF disease-expression rates across a range of diseases and variants, including those previously annotated as pathogenic by diagnostic-testing laboratories. This suggests that many pLoFs have intrinsically incomplete or even no penetrance (i.e., are benign) due to residual allelic activity, complicating prognostication in asymptomatic individuals.