Abstract
Face-to-face interaction is one of the most natural forms of human communication. Unsurprisingly, Video Conferencing (VC) Applications have experienced a significant rise in demand over the past decade. With the widespread availability of cellular devices equipped with high-resolution cameras, Instant Messaging Video Call Applications (IMVCAs) now constitute a substantial portion of VC communications. Given the multitude of IMVCA options, maintaining a high Quality of Experience (QoE) is critical. While content providers can measure QoE directly through end-to-end connections, Internet Service Providers (ISPs) must infer QoE indirectly from network traffic-a non-trivial task, especially when most traffic is encrypted. In this paper, we analyze a large dataset collected from WhatsApp IMVCA, comprising over 25,000 s of VC sessions. We apply four Machine Learning (ML) algorithms and a Large Multimodal Model (LMM)-based agent, achieving mean errors of 4.61%, 5.36%, and 13.24% for three popular QoE metrics: BRISQUE, PIQUE, and FPS, respectively.