Abstract
Ultra-high sampling rates in coherent optical front-ends increasingly exceed the processing capabilities of real-time baseband processors, creating a bottleneck in coherent free-space optical communication systems. We propose a unified state-space framework to systematically parallelize digital signal processing (DSP) algorithms. This approach transforms an algorithm's transfer function into a state-space representation from which a parallel architecture is derived through matrix operations, overcoming the complexity of traditional ad hoc methods. Crucially, our framework enables an analysis of parallelization-induced latency. We introduce the parallel equivalent delay (PED) metric and demonstrate that it introduces right-half-plane zeros into the loop's transfer function, thereby fundamentally constraining stability. This analysis leads to the derivation of "Throughput-Bandwidth Product" (TBP), a constant that provides a design guideline linking maximum stable loop bandwidth to the parallelization factor. The framework's efficacy is demonstrated by designing a parallel Costas carrier recovery loop. Simulations validate its performance, confirm the TBP limit, and show significant advantages over conventional feedforward estimators, especially in low-SNR conditions. Implementation results on a AMD XCVU13P FPGA demonstrate that the proposed 50-parallel architecture achieves a throughput of 15.625 Gsps at a clock frequency of 312.5 MHz with a logic utilization below 7%. The experimental results confirm the theoretical trade-off between throughput and loop bandwidth, verifying the proposed design methodology.