Abstract
Large language models in IoT-edge-cloud settings face bursty, heterogeneous requests that make pipeline-parallel inference prone to micro-batch imbalance and communication stalls, causing GPU idle time and SLO violations. We propose a runtime-adaptive scheduler that jointly tunes token budgets and micro-batch counts to balance prefill/decode workloads and minimize pipeline bubbles under changing compute and network conditions. On a four-node pipeline-parallel cluster across Llama-2-13b and Qwen2.5-14b at 100/1000 Mbps, our method outperforms vLLM and SGLang, reducing GPU idle time by up to 55% and improving throughput by up to 1.61 × while improving TTFT/ITL SLO satisfaction. These results show that dynamic scheduling is essential for scalable, latency-stable LLM inference in IoT-edge-cloud environments.