|M.Sc Student||Rotman Guy|
|Subject||A Deep Multimodal Multilingual Approach for Text|
Processing via Deep Partial Canonical
|Department||Department of Industrial Engineering and Management||Supervisor||Professor Roi Reichart|
|Full Thesis text|
Research in multi-modal semantics deals with the grounding problem, motivated by evidence that many semantic concepts, irrespective of the actual language, are grounded in the perceptual system. In particular, recent studies have shown that performance on natural language processing tasks can be improved by joint modeling of text and vision, with multi-modal and perceptually enhanced representation learning outperforming purely textual representations. The ability of modeling jointly various modalities emerges, mainly, from the appearance of vector space models, which seek to represent inputs as real-valued vectors. By having different modalities, such as images and text, represented as such vectors, they now can live in a shared space that contains them both.
While the main focus is still on monolingual settings, the fact that visual data can serve as a natural bridge between languages has sparked additional interest towards multilingual multi-modal modeling. Such models induce bilingual multi-modal spaces based on multi-view learning.
In this work, we propose a novel effective approach for learning bilingual text embeddings conditioned on shared visual information. This additional perceptual modality bridges the gap between languages and reveals latent connections between concepts in the multilingual setup. The shared visual information in our work takes the form of images with word-level tags or sentence-level captions assigned in more than one language.
We propose a deep neural architecture, referred to as Deep Partial Canonical Correlation Analysis (DPCCA). Our model is based on the Partial CCA (PCCA) method, which, to the best of our knowledge, has not been used in multilingual settings before. In short, PCCA is a variant of CCA which learns maximally correlated linear projections of two views conditioned on a shared third view. In our work, we discuss the PCCA and DPCCA methods and show how they can be applied to textual tasks without having access to shared images at test time inference.
PCCA inherits one disadvantageous property from CCA: both methods compute estimates for covariance matrices based on all training data. This would prevent feasible training of their deep non-linear variants, since deep neural networks (DNNs) are predominantly optimized via stochastic optimization algorithms. To resolve this major hindrance, we propose an effective optimization algorithm for DPCCA, inspired by previous work on Deep CCA (DCCA) optimization.
We evaluate our DPCCA architecture on two semantic tasks: 1) multilingual word similarity and 2) cross-lingual image caption retrieval. The results reveal stable improvements over a large space of non-deep and deep CCA-style baselines in both tasks. Most importantly, 1) PCCA is overall better than other methods which do not use the additional perceptual view; 2) DPCCA outperforms PCCA, indicating the importance of nonlinear transformations modeled through DNNs; 3) DPCCA outscores DCCA, again verifying the importance of conditioning multilingual text embedding induction on the shared visual view; and 4) DPCCA outperforms two recent multi-modal bilingual models which leverage visual information.