Abstract
Visual quality assessment is entering a new frontier as media evolve from static images to temporally dynamic videos and 3D content. These visual signals are typically captured by sensing devices such as cameras and depth sensors, whose acquisition characteristics significantly influence perceptual quality. Traditional quality models, including distortion-centric and regression-based approaches, perform well on conventional degradations but struggle to evaluate higher-level attributes such as semantic plausibility and structural coherence in modern AI-generated and multimodal scenarios. The emergence of large multimodal models (LMMs), including vision–language models (VLMs) and multimodal large language models (MLLMs), reshapes the evaluation paradigm by enabling semantic grounding, instruction-driven assessment, and explainable reasoning. This survey presents a unified perspective on visual quality assessment for sensor-captured visual data across image, video, and 3D modalities. We review conventional deep learning approaches and recent LMM-based methods, highlighting how multimodal fusion and language-conditioned reasoning transform quality assessment from scalar prediction to perceptual intelligence. Finally, we discuss key challenges and future opportunities for building efficient, robust, and sensor-aware visual quality assessment systems.