Speakers
Description
The paper investigates the influence of hyperparameters of the "Show and Tell" image captioning model on the overall efficiency of the method. The method is based on an encoder-decoder approach, where the encoder -- the backbone feature extractor based on the convolutional neural networks (CNN) is responsible for extracting image features and the decoder -- the recurrent neural network (RNN), produces a caption -- a phrase describing the image content. In our research, we tested the encoder part by verifying Densenet, Resnet, and Regnet image feature extractors and the decoder part by changing the size of the RNN sizes. Furthermore, we also investigated the sentence generation stage. The investigation aims to find the optimal feature extractor and decoder size combination. Our research proves that an optimal choice of model's hyperparameters increases caption generation efficiency.