טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentRingel Dor
SubjectCross-Cultural Transfer Learning for Text Classification
DepartmentDepartment of Computer Science
Supervisors Professor Shaul Markovitch
Dr. Kira Radinsky
Full Thesis textFull thesis text - English Version


Abstract

Text classification is considered as one of the key tasks in natural language understanding. A computerized system capable of performing such classification allows to use computers to identify written texts and classify them, and in this sense present a capability that has so far been typical of humans only. In addition to the research significance of gaining such capability, the success of these systems has widespread social, economic, and business implications, as well as many applications in today's digital and global world. Examples of such computerized text classification systems include systems that can identify the key topics expressed in the text, recognize whether a text is written in a way that can be perceived as offensive, or classify a text as formal or informal.


The most prominent results in natural language text classification in recent years has been achieved by employing supervised machine learning algorithms.

These algorithms use large amounts of labeled training datasets to learn a model. Once training is completed, the model is expected to generalize beyond the inputs on which it was trained, and thus, to allow inference to be done on additional inputs that were not a part of the training dataset.


The acquisition process for these labeled datasets is labor-intensive, expensive, and time-consuming. This process is also prone to human errors which impede the quality of both the dataset itself and of models that use it for training.


In this work, we show that cross-cultural differences can be harnessed for natural language text classification.

We present a transfer-learning framework that leverages widely-available unaligned bilingual corpora for text classification tasks, using no task-specific data.

Our empirical evaluation on two tasks - formality classification and sarcasm detection - shows that the cross-cultural difference between German and American English, as manifested in product review text, can be applied to achieve good performance for formality classification, while the difference between Japanese and American English can be applied to achieve good performance for sarcasm detection - both without any task-specific labeled data.