Abstract
This paper presents a new image captioning system which contains facial expression recognition as a way to provide better emotional and contextual comprehension of the captions generated. A combination of affective cues and visual features is made, which enables semantically full and emotionally conscious descriptions. Experiments were carried out on two created datasets, FlickrFace11k and COCOFace15k, with standard benchmarks such as BLEU, METEOR, ROUGE-L, CIDEr, and SPICE to analyze their effectiveness. The suggested model produced better results in all metrics as compared to baselines, like Show-Attend-Tell and Up-Down, remaining consistently better on all the scores. Remarkably, it has reached gains of 2.5 points on CIDEr and 1.0 on SPICE, which means a closer correlation to the prompt captions made by people. A 5-fold cross-validation confirmed the model's robustness, with minimal standard deviation across folds (<±0.2). Qualitative results further demonstrated its ability to capture fine-grained emotional expressions often missed by conventional models. These findings underscore the model's potential in affective computing, assistive technologies, and human-centric AI applications. The pipeline is designed for on-prem/edge deployment with lightweight interfaces to IoT middleware (MQTT/OPC UA), enabling smart-factory integration. These characteristics align the method with Industry 4.0 sensor networks and human-centric analytics.