Abstract
Image-based virtual try-on aims to generate realistic images of individuals wearing target garments by synthesizing input clothing and person images. Traditional methods often follow separate stages, including garment warping, segmentation map generation, and final image synthesis. However, the lack of interaction between these stages frequently causes misalignments and visual artifacts, particularly in scenarios involving occlusions or complex poses. These limitations reduce the overall realism and quality of the generated output. Here, we introduced an enhanced virtual try-on framework addressing these challenges with three key innovations. First, depth maps are incorporated into the model to provide spatial awareness, ensuring precise garment alignment and mitigating occlusion-related issues. Second, a refined garment-masking module improves segmentation consistency by generating accurate garment representations and excluding internal sections. Third, multi-head attention mechanisms are integrated into the feature extraction process to preserve garment textures, patterns, and structural details more effectively. Extensive experiments on a high-resolution virtual try-on dataset demonstrate the effectiveness of the proposed framework. By tackling alignment and occlusion challenges, the model significantly enhances visual quality and outperforms baseline methods, delivering realistic and visually appealing virtual try-on results.