Abstract
SNP microarrays provide a cost-effective genotyping method used in various scientific disciplines. Sample costs vary from tens to hundreds of dollars, storage costs are comparatively reasonable, and analysis methods easily scale to large sample sizes. However, microarrays are designed to be used with high quality samples rather than low-quantity DNA inputs. To deal with this, when working with challenged samples uncertainty must be properly accounted for. Rather than calling crisp genotypes when data are uncertain, it is better to represent them probabilistically. This approach can cleanly feed into tools that directly consider likelihoods while remaining compatible with tools expecting hard calls by removing uncertain genotype calls. Several machine learning algorithms were used to estimate genotypes and genotype likelihoods generated from Illumina Omni5-4 microarray data, and the results were compared. While neural networks and XGBoost were both performant, XGBoost appears to generalize better across sample types generated on the Omni5-4 chips (generalization between technologies awaits further examination). Further, it can more directly produce an estimate of genotype quality (as opposed to scores), a feature that has been lacking in microarray analysis.