Abstract
The advent of environmental DNA (eDNA) metabarcoding marks a transformative era in large-scale biodiversity monitoring. However, the analysis of eDNA datasets is limited by incomplete reference databases and the increasing volume of data requiring processing from raw sequences to annotated taxonomic lists. To curate taxonomic lists from eDNA analysis, geographic constraints are used by expert in post-analysis, which may introduce potential biases in assignments. Instead of relying on expert intervention, a combination of taxonomic and geographic co-occurrences could be directly integrated into machine learning to automatize and improve taxonomic annotation. Here, we introduce a deep learning approach applied to the taxonomic assignment of eDNA sequences, which leverages a species reference database, species co-occurrence data, and a phylogeny to enhance annotation directly from raw sequences. The phylogeny provides the structure to the network's embedding space in which DNA sequences are placed utilizing an artificial neural network (ANN). We train an additional ANN from the phylogenetic embedding and co-occurrence species data to learn coherent species combinations from the whole collection of eDNA sequences, as opposed to single sequences only. When applied directly to the raw sequences, this method correctly predicts unseen species (i.e., those not contained in the reference database), out of more than 31,000 possibilities, in about 24% of the tested cases by relying on phylogenetic embeddings and geographic modulation. The trained ANNs discern species relationships accurately from the raw data, which facilitates the process of associating sequences with taxa-even those absent from reference databases. When we use real eDNA samples, our predictions mostly agree with those from a traditional bioinformatic pipeline, highlighting the potential of our method for the annotation of the increasing number of eDNA sequences.