Abstract
Controllable Image Captioning (CIC) aims to generate coherent and semantically faithful textual descriptions of images while adhering to user-specified constraints. Existing methods have achieved promising results under individual constraints such as sentimental style or sentence length. However, they typically fail to handle and satisfy multiple constraints simultaneously, as the controls often interact and interfere with one another. To overcome these challenges, we propose Internal-External Multi-Agent Steering (IE-MAS) for CIC. IE-MAS introduces an internal multimodal steering (IMS) strategy to control affective coherence within the caption, and an external multi-agent collaboration system (EMCS) to guide visual grounding and contextual alignment. From an information-theoretic view, IMS reduces uncertainty in the generation process, while EMCS strengthens the dependency between captions and visual inputs, converting the length and sentiment constraints into information gains. Together, they produce a stable balance among semantic consistency, affective expression, and length control through an adaptive steering process that dynamically balances internal linguistic control and external perceptual grounding. Experimental results demonstrate that IE-MAS effectively coordinates multiple constraints, producing captions that satisfy the length constraint and are sentimental expressive and visually faithful.