Abstract
Intrusion detection systems (IDS) leveraging federated learning (FL) are increasingly deployed in Internet of Things (IoT) environments to address distributed data and privacy constraints. However, generalization remains unclear because most evaluations rely on a single dataset, which risks overfitting to site-specific traffic, label taxonomies, and non-IID client mixtures. This study provides a comprehensive dataset-centric evaluation of FL-based IDS across three contemporary IoT/IIoT datasets: Edge-IIoTset (2022), CIC-IoT2023, and TII-SSRC-23 (2023), that differ in devices, feature distributions, and attack families. We benchmark three FL aggregation algorithms (FedAvg, FedProx, FedNova) with two deep learning backbones (LSTM and Transformer) to assess detection accuracy, cross-environment generalizability, convergence behavior, and communication cost. Methodologically, we construct non-IID clients by device or application type, harmonize labels to a common family-level schema, align features to the intersection set, and evaluate three regimes: in-domain, cross-dataset, and a combined multi-dataset federation. Results show that federated models approach centralized performance in-domain, with macro-F1 up to 98% and accuracies in the 92-98% range. Transformers consistently exceed LSTM by ≈1-2% points in macro-F1 at comparable communication budgets. Cross-dataset tests expose substantial degradation, with up to 30 percentage-point macro-F1 loss when models face unseen environments, underscoring the need for diverse training coverage. Combined multi-dataset federation substantially restores transfer, yielding ≈90% macro-F1 across datasets in the harmonized family-level setting. Under heterogeneous clients, FedProx improves stability by reducing round-to-round variance, while FedNova achieves target accuracy in fewer rounds and lowers communication by ≈15-25% relative to FedAvg. These findings indicate a practical recipe for deployment: prioritize attack and environment diversity through combined-dataset FL, select Transformer backbones where feasible, and use FedProx or FedNova to stabilize training and reduce communication in bandwidth-constrained IoT settings.