PURPOSE: Unlabeled medical image data are abundant, yet the process of converting them into a labeled ("truth-known") database is time and resource expensive and fraught with ethical and logistics issues. The authors propose a dual-stage CADx scheme in which both labeled and unlabeled (truth-known and "truth-unknown") data are used. This study is an initial exploration of the potential for leveraging unlabeled data toward enhancing breast CADx. METHODS: From a labeled ultrasound image database consisting of 1126 lesions with an empirical cancer prevalence of 14%, 200 different randomly sampled subsets were selected and the truth status of a variable number of cases was masked to the algorithm to mimic different types of labeled and unlabeled data sources. The prevalence was fixed at 50% cancerous for the labeled data and 5% cancerous for the unlabeled. In the first stage of the dual-stage CADx scheme, the authors term "transductive dimension reduction regularization" (TDR-R), both labeled and unlabeled images characterized by extracted lesion features were combined using dimension reduction (DR) techniques and mapped to a lower-dimensional representation. (The first stage ignored truth status therefore was an unsupervised algorithm.) In the second stage, the labeled data from the reduced dimension embedding were used to train a classifier toward estimating the probability of malignancy. For the first CADx stage, the authors investigated three DR approaches: Laplacian eigen-maps, t-distributed stochastic neighbor embedding (t-SNE), and principal component analysis. For the TDR-R methods, the classifier in the second stage was a supervised (i.e., utilized truth) Bayesian neural net. The dual-stage CADx schemes were compared to a single-stage scheme based on manifold regularization (MR) in a semisupervised setting via the LapSVM algorithm. Performance in terms of areas under the ROC curve (AUC) of the CADx schemes was evaluated in leave-one-out and .632+ bootstrap analyses on a by-lesion basis. Additionally, the trained algorithms were applied to an independent test data set consisting of 101 lesions with approximately 50% cancer prevalence. The difference in AUC (deltaAUC) between with and without the use of unlabeled data was computed. RESULTS: Statistically significant differences in the average AUC value (deltaAUC) were found in many instances between training with and without unlabeled data, based on the sample set distributions generated from this particular ultrasound data set during cross-validation and using independent test set. For example, when using 100 labeled and 900 unlabeled cases and testing on the independent test set, the TDR-R methods produced average deltaAUC=0.0361 with 95% intervals [0.0301; 0.0408] (p-value < 0.0001, adjusted for multiple comparisons, but considering the test set fixed) using t-SNE and average deltaAUC=.026 [0.0227, 0.0298] (adjusted p-value < 0.0001) using Laplacian eigenmaps, while the MR-based LapSVM produced an average deltaAUC=.0381 [0.0351; 0.0405] (adjusted p-value < 0.0001). The authors also found that schemes initially obtaining lower than average performance when using labeled data only showed the most prominent increase in performance when unlabeled data were added in the first CADx stage, suggesting a regularization effect due to the injection of unlabeled data. CONCLUSION: The findings reveal evidence that incorporating unlabeled data information into the overall development of CADx methods may improve classifier performance by non-negligible amounts and warrants further investigation.
Enhancement of breast CADx with unlabeled data.
阅读:3
作者:Jamieson Andrew R, Giger Maryellen L, Drukker Karen, Pesce Lorenzo L
| 期刊: | Medical Physics | 影响因子: | 3.200 |
| 时间: | 2010 | 起止号: | 2010 Aug;37(8):4155-72 |
| doi: | 10.1118/1.3455704 | ||
特别声明
1、本文转载旨在传播信息,不代表本网站观点,亦不对其内容的真实性承担责任。
2、其他媒体、网站或个人若从本网站转载使用,必须保留本网站注明的“来源”,并自行承担包括版权在内的相关法律责任。
3、如作者不希望本文被转载,或需洽谈转载稿费等事宜,请及时与本网站联系。
4、此外,如需投稿,也可通过邮箱info@biocloudy.com与我们取得联系。
