Abstract
The increasing demand for distributed machine learning like Federated Learning (FL) in dynamic, resource-constrained edge environments, 5G/6G networks, and the proliferation of mobile and edge devices, presents significant challenges related to fault tolerance, elasticity, and communication efficiency. This paper addresses these issues by proposing a novel modular and resilient FL framework. In this context, resilience refers to the system's ability to maintain operation and performance despite disruptions. The framework is built on decoupled modules handling core FL functionalities, allowing flexibility in integrating various algorithms, communication protocols, and resilience strategies. Results demonstrate the framework's ability to integrate different communication protocols and FL paradigms, showing that protocol choice significantly impacts performance, particularly in high-volume communication scenarios, with Zenoh and MQTT exhibiting lower overhead than Kafka in tested configurations, and Zenoh emerging as the most efficient communication option. Additionally, the framework successfully maintained model training and achieved convergence even when simulating probabilistic worker failures, achieving a MCC of 0.9453.