טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
Ph.D Thesis
Ph.D StudentLeviant Ira
SubjectMultilingual Word Embeddings: Evaluation and Template-Based
Algorithms
DepartmentDepartment of Industrial Engineering and Management
Supervisor ASSOCIATE PROF. Roi Reichart
Full Thesis textFull thesis text - English Version


Abstract

In recent years, the Natural Language Processing (NLP) community interest has been drawn to  the development of Vector Space Models (VSMs) of semantics. These models map lexical units such as words, phrases or sentences into vectors, allowing NLP algorithms to compute semantic distances between these units. Most VSMs are based on the distributional hypothesis, stating that words that occur in similar contexts tend to have similar meanings. 

Our focus in this thesis is on multilingual word meaning representations.  Word meaning representation (or generally called word embedding) is a mathematical object associated with each word, often a vector.  Multilingual word representations map between the word embedding spaces for different languages, or a common word embedding space for all languages enables a shared semantic space that reveals word correspondences across languages.

Humans as well as VSMs may consider various languages when making their judgments and predictions.The resulting models are evaluated either in an intrinsic human based evaluation, where human scores are most often produced for word pairs presented to the human evaluators in English, or in application based evaluation with tasks such as cross-lingual text mining, document classification and sentiment analysis. In this thesis we focus on human based evaluation, where a correlation between the model scores and the human scores  is computed. 

We show significant differences in human based evaluation across languages and establish the importance of the judgment language (JL), the language in which word pairs are presented to human evaluators, on human semantic judgments and on their correlation with VSM predictions.

Next, we introduce Multi-SimLex, a large-scale lexical resource and evaluation benchmark covering datasets for 12 typologically diverse languages. 

Finally, we present a fully unsupervised algorithm SG-IWE for the extraction of patterns which is suitable for capturing word similarity and is easily adjustable to a multilingual setup.