A comprehensive survey on automatic image captioning-deep learning techniques, datasets and evaluation parameters
Abstract
Automatic image captioning is a pivotal intersection of computer vision and natural language processing, aiming to generate descriptive textual content from visual inputs. This comprehensive survey explores the evolution and state-of-the-art advancements in image caption generation, focusing on deep learning techniques, benchmark datasets, and evaluation parameters. We begin by tracing the progression from early approaches to contemporary deep learning methodologies, emphasizing encoder-decoder based models and transformer-based models. We then systematically review the datasets that have been instrumental in training and benchmarking image captioning models, including MSCOCO, Flickr30k, Flickr8k, and PASCAL 1k, discussing image count, types of scenes, and sources. Furthermore, we delve into the evaluation metrics employed to assess model performance, such as bilingual evaluation understudy (BLEU), metric for evaluation of translation with explicit ordering (METEOR), recall-oriented understudy for gisting evaluation (ROUGE), and consensus-based image description evaluation (CIDEr), analyzing their domains, bases, and measurement criteria. Through this survey, we aim to provide a detailed understanding of the current landscape, identify challenges, and propose future research directions in automatic image captioning.
Keywords
Attention mechanism; Convolutional neural network-recurrent neural network; Description generation; Encoder-decoder; Image caption; Transformer
Full Text:
PDFDOI: http://doi.org/10.11591/ijece.v15i3.pp3257-3266
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
International Journal of Electrical and Computer Engineering (IJECE)
p-ISSN 2088-8708, e-ISSN 2722-2578
This journal is published by theĀ Institute of Advanced Engineering and Science (IAES).