Abstract
We propose a similarity measure based on the multivariate hypergeometric distribution for the pairwise comparison of images and data vectors. The formulation and performance of the proposed measure are compared with other similarity measures using synthetic data. A method of piecewise approximation is also implemented to facilitate application of the proposed measure to large samples. Example applications of the proposed similarity measure are presented using mass spectrometry imaging data and gene expression microarray data. Results from synthetic and biological data indicate that the proposed measure is capable of providing meaningful discrimination between samples, and that it can be a useful tool for identifying potentially related samples in large-scale biological data sets.