MMAgentRec, a personalized multi-modal recommendation agent with large language model.

In multimodal recommendation, various data types, including text, images, and user dialogues, are utilized. However, it faces two primary challenges. Firstly, identifying user requirements is challenging due to their inherent complexity and diverse intentions. Secondly, the scarcity of high quality datasets and the unnaturalness of recommendation systems pose pressing issues. Especially interactive datasets,and datasets that can evaluate large models and human temporal interactions.In multimodal recommendation, users often face problems such as fragmented information and unclear needs. At the same time, data scarcity affects the accuracy and comprehensiveness of model evaluation and recommendation. This is a pain point in multimodal recommendation. Addressing these issues presents a significant opportunity for advancement. Combining multimodal backgrounds with large language models offers prospects for alleviating pain points. This integration enables systems to support a broader array of inputs, facilitating seamless dialogues and coherent responses. This article employs multimodal techniques, introducing cross-attention mechanisms, self-reflection mechanisms, along with multi-graph neural networks and residual networks. Multimodal techniques are responsible for handling data input problems. Cross-attention mechanisms are used to handle the combination of images and texts. Multi-graph neural networks and residual networks are used to build a recommendation system framework to improve the accuracy of recommendations. These are combined with an adapted large language model (LLM) using the reflection methodology,LLM takes advantage of its ease of communication with humans, proposing an autonomous decision-making and intelligent recommendation-capable multimodal system with self-reflective capabilities. The system includes a recommendation module that seeks advice from different domain experts based on user requirements. Through experimentation, our multimodal system has made significant strides in understanding user intent based on input keywords, demonstrating superiority over classic multimodal recommendation algorithms such as Blip2, clip. This indicates that our system can intelligently generate suggestions, meeting user requirements and enhancing user experience. Our approach provides novel perspectives for the development of multimodal recommendation systems, holding substantial practical application potential and promising to propel their evolution in the information technology domain. This indicates that our system can intelligently generate suggestions, meeting user requirements and enhancing user experience. Our approach provides novel perspectives for the development of multimodal recommendation systems, holding substantial practical application potential and promising to propel their evolution in the information technology domain. We conducted extensive evaluations to assess the effectiveness of our proposed model, including an ablation study, comparison with state-of-the-art methods, and performance analysis on multiple datasets. Ablation Study results demonstrate that the full model achieves the highest performance across all metrics, with an accuracy of 0.9526, precision of 0.94, recall of 0.95, and an F1 score of 0.94. Removing key components leads to performance degradation, with the exclusion of the LLM component having the most significant impact, reducing the F1 score to 0.91. The absence of MGCN and Cross-Attention also results in lower accuracy, confirming their critical role in enhancing model effectiveness. Comparison with state-of-the-art methods indicates that our model outperforms LightGCN and DualGNN in all key metrics. Specifically, LightGCN achieves an accuracy of 0.9210, while DualGNN reaches 0.9285, both falling short of the proposed model's performance. These results validate the superiority of our approach in handling complex multimodal tasks. Experimental results on multiple datasets further highlight the effectiveness of MGCN and Cross-Attention. On the QK-Video and QB-Video datasets, MGCN achieves the highest recall scores, with Recall@5 reaching 0.6556 and 0.6856, and Recall@50 attaining 0.9559 and 0.9059, respectively. Cross-Attention exhibits strong early recall capabilities, achieving Recall@10 of 0.8522 on the Tourism dataset. In contrast, Clip and Blip2 show moderate recall performance, with Clip achieving only 0.3423 for Recall@5 and Blip2 reaching 0.4531 on the Tourism dataset. Overall, our model consistently surpasses existing approaches, with MGCN and Cross-Attention demonstrating superior retrieval and classification performance across various tasks, underscoring their effectiveness in visual question answering (VQA). At the same time, this paper has constructed a comprehensive dataset in this field, each column contains 9004 data entries.

期刊：	Scientific Reports	影响因子：	3.900
时间：	2025	起止号：	2025 Apr 8; 15(1):12062
doi：	10.1038/s41598-025-96458-w

MMAgentRec, a personalized multi-modal recommendation agent with large language model.

特别声明