טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentRubinstein Meiran
SubjectAutomatic Selection of Accutate Image Captions
DepartmentDepartment of Industrial Engineering and Management
Supervisor Assistant Professor Roi Reichart
Full Thesis textFull thesis text - English Version


Abstract

Understanding the relations between an image and a natural language sentence depicting it (i.e. caption) is becoming an important task in many applications, such as
sentence-based image search, image annotation, early childhood education, and navigation for the blind. Image retrieval (retrieving an image using natural language as a query) is useful for image search, while sentence retrieval (describing the image with natural language sentences) could be used for automated image captioning or identification of the most accurate caption from two very related captions, which originally were assigned to the image.
The relationship between an image and a caption can be formalized as a multimodal
matching problem, in which related image-caption pairs would get higher matching
scores than unrelated ones. This kind of problem requires deep and detailed understanding of the image’s content and the relating captions, including their inter-modal correspondence, which turns out to be a highly challenging problem. Recently deep neural networks have been employed to better understand the representation of images, captions and their relations (m-RNN and m-CNN), but none of them were
dealing with the issue of triplets: a single image with two related captions.
In order to deal with the triplets, we have constructed a new database which contains 1000 triplets. Each triplet consists of an image and two relating captions. Using CrowdFlower crowd-sourcing platform, each of the captions within the triplet was
given a score, based on an average score from 7 contributors, for how accurately it depicts the image, on a scale of 1 to 10 (with 1 being ”The text doesn’t describe the imageat all”, and 10 being ”The sentence provides an appropriate description of the image”).
We based our dataset on two known sentence-based image description datasets: the
Flickr30K dataset and the IAPR TC-12 dataset. We needed to project each
pair of image-caption vectors to the same space, thus we used Princiapl Component
Analysis (PCA), Deep Canonical Correlation Analysis (DCCA) and Shared
Layers ability (Shared Representation) for that matter. Two classifiers were
then trained for the mission of properly identifying the most accurate caption for the
image:
First classifier (regressor) reviews the triplet and gives each image-caption pair a
score (from 1 to 10), based on the CrowdFlower rating results. The pair with the
higher score refers to the caption that describes the image better.
Second classifier (regressor) reviews the triplet and learns which caption depicts
the image most, based on the CrowdFlower rating results.
We present this dataset and the relating classifiers, and intend for them to be of broad
interest across Natural Language Processing (NLP) and Computer Vision (CV) researchers. We anticipate that these communities would put the data into use in a broader range of tasks than we can foresee.