Abstract
Heavy-load unmanned aerial vehicles (UAVs) are increasingly being applied in logistics, infrastructure installation, and emergency response missions, where complex payload dynamics and unstructured environments pose significant challenges to safe and efficient operation. Conventional manual teleoperation interfaces, such as dual-joystick control, impose a high cognitive workload and provide limited support for expressing high-level operator intent, while fully autonomous solutions remain difficult to deploy reliably under real-world uncertainty. To address these limitations, this paper proposes the Multimodal Fusion Cooperation Network (MFCN), an end-to-end shared autonomy framework that integrates speech commands, visual gestures, and haptic cues through cross-modal feature fusion to infer operator intent in real time. The fused intent representation is translated into dynamically feasible control commands by a cooperative control policy with embedded physics-aware constraints to suppress payload oscillations and ensure flight stability. Extensive semi-physical simulations and real-world experiments demonstrate that the MFCN significantly improves the task success rate, positioning accuracy, and payload stability while reducing the task completion time and operator cognitive workload compared with manual, unimodal, and heuristic multimodal baselines.