Abstract
Reconstructing hand and object shapes from a single view during interaction remains challenging due to severe mutual occlusion and the need for high physical plausibility. To address this, we propose a novel framework for hand-object interaction reconstruction based on holistic, multi-stage collaborative optimization. Unlike methods that process hands and objects independently or apply constraints as late-stage post-processing, our model progressively enforces physical consistency and geometric accuracy throughout the entire reconstruction pipeline. Our network takes an RGB-D image as input. An adaptive feature fusion module first combines color and depth information to improve robustness against sensing uncertainties. We then introduce structural priors for 2D pose estimation and leverage texture cues to refine depth-based 3D pose initialization. Central to our approach is the iterative application of a dense mutual attention mechanism during sparse-to-dense mesh recovery, which dynamically captures interaction dependencies while refining geometry. Finally, we use a Signed Distance Function (SDF) representation explicitly designed for contact surfaces to prevent interpenetration and ensure physically plausible results. Through comprehensive experiments, our method demonstrates significant improvements on the challenging ObMan and DexYCB benchmarks, outperforming state-of-the-art techniques. Specifically, on the ObMan dataset, our approach achieves hand CD(h) and object CD(o) metrics of 0.077 cm(2) and 0.483 cm(2), respectively. Similarly, on the DexYCB dataset, it attains hand CD(h) and object CD(o) values of 0.251 cm(2) and 1.127 cm(2), respectively.