Abstract
There is an ongoing effort in the machine learning community to enable machines to understand the world symbolically, facilitating human interaction with learned representations of complex scenes. A pre-requisite to achieving this is the ability to identify the dynamics of interacting objects from time traces of relevant features. In this paper, we introduce GrODID (GRaph-based Object-Centric Dynamic Mode Decomposition), a framework based on graph neural networks that enables Dynamic Mode Decomposition for systems involving interacting objects. The main idea is to model individual, potentially non-linear dynamics using a Koopman operator and identify its corresponding Dynamic Mode Decomposition using deep AutoEncoders, while the interactions amongst systems are captured by a graph, modeled by a Graph Neural Net (GNN). The potential of this approach is illustrated with several applications arising in the context of video analytics: video forward and backwards prediction, video manipulation and achieving temporal super-resolution.