טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentHagar Loeub
SubjectA Model Composition Approach to Language and Vision
Combination in Natural Language Processing
DepartmentDepartment of Industrial Engineering and Management
Supervisor Assistant Professor Reichart Roi
Full Thesis textFull thesis text - English Version


Abstract

Distributional Semantics is a research area which aims to learn the semantic similarity between linguistic items based on their distributional properties in a large corpus.

Language technology benefits from knowing the relationships between words in the corpus. For example, a search engine could expand the range of web pages being returned for a single query by considering additional terms which are close in meaning to those in the original query.


In Distributional Semantics Models (DSM), each word is represented by a vector in a high-dimensional “semantic space”. These models specify mechanisms for constructing semantic representations from text corpora based on the distributional hypothesis [Harris 1954] which claims that words that appear in similar linguistic contexts are likely to have related meanings. DSMs have been very effectively applied to a variety of semantic tasks [Clark 2015, Mikolov 2013b, TurneyandPantel 2010]. However, by comparing results to human semantic knowledge it seems that these purely textual models are limited.

One way to improve those results is by enriching the linguistic vectors with perceptual information. This idea led to the development of Multi-modal Distributional Semantic Models (MDSM) that represent each word in the vocabulary as a vector of linguistic and visual features. These multi-modal models outperform state-of-the-art text-based approaches on certain evaluation sets. However it is still a relatively new field of research and it is hoped that greater improvement can be achieved from the combination of lexical and visual information.

The two main approaches of learning the meaning of words based on textual and perceptual input are: (1) The Sequential Modeling approach - the separate construction of visual and textual representations followed by their merging. (2) The Joint Modeling approach - the direct learning of a combined representation.

In this thesis we propose a new model following the first approach for language and vision combination by performing a systematic search in the space of modeling motifs in order to find an optimal sequence of models.

Our sequential models outperform recent alternative multi-modal models on five standard semantic benchmarks, including the ones that rely on joint representation learning.

We also present the Residual CCA (R-CCA) method which complements the standard CCA method by representing, for each modality, the difference between the original signal and the signal projected to the shared, max correlation, space.

For two of five standard semantic benchmarks the R-CCA method is part of the Best configuration that our algorithm yields.