Abstract
Background: Accurate forecasting of lung tumor motion is crucial for precise radiotherapy. Deep-learning-based markerless tracking methods have been explored, but extending these approaches to predict future tumor trajectories remains largely unaddressed. We address this by framing markerless lung tumor motion forecasting as a spatio-temporal prediction task using a vision transformer to estimate three-dimensional tumor positions over short horizons. Methods: Digitally reconstructed radiographs (DRRs) generated from four-dimensional computed tomography scans of 12 lung cancer patients were used to train a multi-patient (MP) model. Patient-specific (PS) models trained solely on planning data were compared, and the MP model was further fine-tuned using a small number of patient-specific treatment images under realistic clinical constraints. Models processed sequences of 12 DRRs, with performance evaluated via root mean square error. Results: The results indicate that low-resolution inputs with larger patch sizes outperform higher-resolution configurations by reducing image noise. PS models require extensive data to match MP performance, whereas fine-tuning the MP model with limited patient-specific data achieves comparable or superior forecasting accuracy at a lower cost. Conclusions: These findings demonstrate that Vision Transformers can extend markerless tracking methods to accurate short-term forecasting and highlight fine-tuning as an efficient strategy for personalized prediction.