Abstract
Domain insertion creates architectures where one domain interrupts another's sequence. Analysis across 2.7 million classified domains reveals that insertions occur in 20% of multidomain proteins, with 331 families exhibiting consistent architectural roles: 162 function exclusively as hosts, while 169 exclusively serve as inserted modules, such as zinc-binding dehydrogenases appearing as insertions across 450 events. The remaining 1116 families with sufficient insertion activity demonstrate versatile behavior, adopting different roles depending on partnership context. Size analysis shows inserted domains are consistently smaller than their hosts (median 115 vs. 199 residues), with role-consistent families exhibiting 1.7-fold size differences. Insertions frequently involve domains from different structural superfamilies: 31,925 events (65.8% of total) occur between families from different H-groups, such as P-loop hydrolases with tRNA modification domains. While most insertions are simple single-level architectures, insertion mechanisms can create complex organizations, including six-level nested structures in cyanobacterial RNA polymerase. This work provides a comprehensive dataset of 48,551 insertion events across 5701 families, with quantitative characterization of size relationships and partnership patterns that can inform structure prediction and protein design efforts.