Visual enumeration remains challenging for multimodal generative AI

对于多模态生成式人工智能而言,视觉枚举仍然是一个挑战。

阅读:1

Abstract

Many animal species can approximately judge the number of objects in a visual scene at a single glance, and humans can further determine the exact cardinality of a set by deploying systematic counting procedures. In contrast, it has been observed that even state-of-the-art AI systems have very limited enumeration skills. In this work, we propose two benchmark tasks inspired by cognitive science that allow to precisely evaluate the visual enumeration capabilities of multimodal foundation models, thereby providing an objective measure of their number sense and counting level. We consider popular visual question answering models (BLIP, LLaVA and ViLT) as well as advanced image-to-text (Gemini, GPT and Qwen) and text-to-image (DALL-E, FLUX and Stable Diffusion) AI systems. Our analyses show that even the most advanced models cannot reliably name the number of objects in simple visual stimuli or generate images containing a target number of items, as indexed by their low accuracy in both types of tasks. Especially for numbers outside the subitizing range, their responses are often far from the target numerosity, and, in stark contrast with human behavior, in many cases the distribution of errors depends on the object category. We also observe some striking mistakes with small numbers. Our findings demonstrate that developing an intuitive visual understanding of number remains challenging for AI models and that merely increasing model size might not be a viable strategy to promote the emergence of systematic counting skills. We release the full code of our benchmark to facilitate the evaluation of enumeration skills in future AI systems.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。