Abstract
Generative Large Language Models (LLMs) are transforming mental health care by enabling the generation and understanding of human-like text with increasing nuance and contextual awareness. However, mental health is a complex, multidimensional domain that often requires richer sources of information beyond text. This narrative review explores the emerging role of Multimodal LLMs (MLLMs), which are models that integrate diverse input modalities such as speech, images, video, and physiological signals, to incorporate the multifaceted nature of mental states and human interactions. We first outline the foundational principles of MLLMs and their distinction from traditional text-only LLMs. We then synthesize recent empirical studies and experimental applications of MLLMs in mental health research and clinical settings, highlighting their potential to improve diagnostic accuracy, enable real-time monitoring, and support context-aware, personalized interventions. Finally, we outline opportunities for future research and innovation, and discuss key implementation challenges in MLLM-based mental health care.