Abstract
This paper presents a transformer-based method for colorizing grayscale image. By employing a deep architecture with stacked encoder-decoder layers, the model effectively captures intricate features, significantly improving its expressive capacity. The encoder primarily extracts features from the input grayscale image, while the decoder utilizes these features alongside the self-attention mechanism and contextual information to predict the corresponding [Formula: see text] and [Formula: see text] chrominance components in CIELAB color space, and combines the channel [Formula: see text] of the input image to achieve image colorization. To ensure the generated image mimics a real-life image in terms of pixel accuracy and visual quality, the algorithm employs a variety of loss functions. Experimental results demonstrate significant performance improvements in black and white image colorization. The generated images exhibit natural coloring and rich details, making them highly valuable in practical applications.