Leveraging two-dimensional pre-trained vision transformers for three-dimensional model generation via masked autoencoders

利用二维预训练视觉Transformer模型,通过掩码自编码器生成三维模型

阅读:1

Abstract

Although the Transformer architecture has established itself as the industry standard for jobs involving natural language processing, it still has few uses in computer vision. In vision, attention is used in conjunction with convolutional networks or to replace individual convolutional network elements while preserving the overall network design. Differences between the two domains, such as significant variations in the scale of visual things and the higher granularity of pixels in images compared to words in the text, make it difficult to transfer Transformer from language to vision. Masking autoencoding is a promising self-supervised learning approach that greatly advances computer vision and natural language processing. For robust 2D representations, pre-training with large image data has become standard practice. On the other hand, the low availability of 3D datasets significantly impedes learning high-quality 3D features because of the high data processing cost. We present a strong multi-scale MAE prior training architecture that uses a trained ViT and a 3D representation model from 2D images to let 3D point clouds learn on their own. We employ the adept 2D information to direct a 3D masking-based autoencoder, which uses an encoder-decoder architecture to rebuild the masked point tokens through self-supervised pre-training. To acquire the input point cloud's multi-view visual characteristics, we first use pre-trained 2D models. Next, we present a two-dimensional masking method that preserves the visibility of semantically significant point tokens. Numerous tests demonstrate how effectively our method works with pre-trained models and how well it generalizes to a range of downstream tasks. In particular, our pre-trained model achieved 93.63% accuracy for linear SVM on ScanObjectNN and 91.31% accuracy on ModelNet40. Our approach demonstrates how a straightforward architecture solely based on conventional transformers may outperform specialized transformer models from supervised learning.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。