Abstract
The automatic detection of student behaviors is essential for improving smart classroom technologies and offering data-driven insights regarding student engagement. Nevertheless, existing methods encounter considerable obstacles caused by class imbalance, restricted annotations, and the slight visual resemblances among behavior categories. To overcome these constraints, we present a meta-learning framework that combines Vision Transformers with Prototypical Networks, improved by supervised contrastive learning and hard negative mining. The process starts by preprocessing and cropping the input images, utilizing YAML annotations to focus on behavior-specific areas. Every input is converted into patch embeddings and handled by Transformer encoders, producing distinctive feature representations. Class prototypes are subsequently derived from the support set, and query samples are categorized through distance-based metrics within a few-shot learning framework based on episodes. Extensive experiments were carried out on the SCB-05 dataset under 5-way few-shot settings to confirm the effectiveness of the proposed framework. The findings show that combining Vision Transformers with contrastive learning greatly enhances feature distinctiveness, whereas hard negative mining additionally boosts generalization. Under the 5-way 10-shot evaluation protocol, our method attains a total accuracy of 91.3% and a mean Average Precision, exceeding the performance of both baseline ProtoNet and Transformer variants without hard negative mining. Further analyses, such as class-specific assessments, confusion matrices, and embedding visualizations, validate the strength and clarity of the suggested model. These results set a new standard for recognizing student behavior and emphasize the promise of meta-learning frameworks for practical uses in education.