Abstract
Next-generation sequencing (NGS) data usage is widespread, but its compositional nature poses challenges. We evaluated four normalization methods (relative abundance, CLR, TMM, DESeq2) for identifying true signals in compositional microbiota data using simulations. Two experiments were conducted: one with only increases in specific taxa, and a 1:1 increase/decrease in specific taxa. Simulated sequencing produced compositional data, which were normalized using the four methods. The study compared absolute abundance data and the normalized compositional data using variance explained and false discovery rates. All normalization methods showed decreased variance explained and increased false positives and negatives compared to absolute abundance data. CLR, TMM, and DESeq2 did not improve over relative abundance data and sometimes worsened false discovery rates. The study highlights that false positives and negatives are common in compositional NGS datasets, and current normalization methods do not consistently address these issues. Compositionality artefacts should be considered when interpreting NGS results and obtaining absolute abundances of features/taxa is recommended to distinguish biological signals from artefacts.