Abstract
Many radar applications rely primarily on visual classification for their evaluations. However, new research is integrating textual descriptions alongside visual input and showing that such multimodal fusion improves contextual understanding. A critical issue in this area is the effective alignment of coded text with corresponding images. To this end, our paper presents an adversarial training framework that generates descriptive text from the latent space of a visual radar classifier. Our quantitative evaluations show that this dual-task approach maintains a robust classification accuracy of 98.3% despite the inclusion of Gaussian-distributed latent spaces. Beyond these numerical validations, we conduct a qualitative study of the text output in relation to the classifier's predictions. This analysis highlights the correlation between the generated descriptions and the assigned categories and provides insight into the classifier's visual interpretation processes, particularly in the context of normally uninterpretable radar data.