Abstract
Visual perception unfolds through a hierarchy of transformations, beginning with the extraction of low-level features, such as edges, and culminating in the representation of high-level features such as object categories. While the processing of low- and high-level features is well studied, the intermediate transformations, that is, mid-level features, remain poorly understood. Here, we introduce a stimulus set of naturalistic 3D-rendered images and videos with ground-truth annotations for five candidate mid-level features (reflectance, scene depth, world normals, lighting, and skeleton position) alongside for one low-level feature (edges) and for one high-level feature (action identity). To determine when these features are processed in the brain, we collected electroencephalography (EEG) responses during stimulus presentation and trained linearized encoding models to predict EEG responses from the annotations. We first showed that candidate mid-level features were best represented between ~100 and 250 ms post-stimulus, between low- and high-level features, and consistent with a bridging role linking sensory and semantic processing. We then assessed convolutional neural networks (CNNs) as models of mid-level feature processing in humans and observed that although their hierarchies were shallower, they exhibited a comparable processing order for mid-level but not low- or high-level features, only for videos. Together, our results support the view that mid-level features are tied to surface- and shape-related processing and establish 3D-rendered stimuli with annotations as a valuable tool for investigating mid-level vision in biological and artificial neural networks.