טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentJacobs Kayla
SubjectHebrew Acronym: Identification; Expansion and Disambiguation
DepartmentDepartment of Computer Science
Supervisors Professor Emeritus Alon Itai
Dr. Shalom Wintner
Full Thesis textFull thesis text - English Version


Abstract

Acronyms are words formed from the initial letters of a phrase. For example, CIA is a well-known acronym for the Central Intelligence Agency, though in other contexts could mean the Culinary Institute of America or Rome's Ciampino Airport. Understanding acronyms is important for many natural language processing applications, including search and machine translation.

While hand-crafted acronym dictionaries exist, they are limited and require frequent updates. We developed a new machine learning method to automatically build a Modern Hebrew acronym dictionary from unstructured text documents. This is the first such technique, in any language, to specifically include acronyms whose expansions do not necessarily appear in the same documents. We also enhanced the dictionary with contextual information to help select the expansions most appropriate for a given acronym in context. When applied to acronym disambiguation, our dictionary achieved better results than dictionaries built using prior techniques.

Additionally, while acronyms have a long history in Hebrew, and have previously been investigated from a linguistic perspective, they have never before been studied quantitatively. We discovered new statistically-based linguistic insights about acronym usage in Modern Hebrew texts, of interest to Hebrew language aficionados and developers of Hebrew natural language processing systems.