טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentBar-Haim Roy
SubjectPart-of-Speech Tagging for Hebrew and other
Semitic Languages
DepartmentDepartment of Computer Science
Supervisor Mr. Yoad Winter


Abstract

Morphological ambiguity is very common in modern Hebrew, mainly since most of the written texts are unvocalized. Reducing the level of morphological ambiguity is crucial for almost every natural language application in Hebrew. This thesis deals with part-of-speech (POS) tagging for Hebrew and other Semitic languages, using Hidden Markov Models (HMMs). Most of the previous work on HMM tagging was done for languages such as English, in which the basic syntactic units are words. By contrast, words in written Semitic texts often consist of multiple morphemes, each with its own POS tag. Since words may have more than one possible segmentation into morphemes, an HMM tagger for Semitic languages must also choose the right segmentation for each word. We compare word-level tagging, using complex tags which encode both tagging and segmentation, with morpheme-level tagging, in which the input for the HMM tagger is multiple morpheme sequences. We show that morpheme-level tagging outperforms word-level tagging. Enriching the morphemes with information about word boundaries improves the segmentation accuracy but degrades tagging accuracy. Therefore, the configuration we propose performs two phases of morpheme-level tagging: the first phase uses word boundaries and determines the segmentation, and the second phase ignores word boundaries and tags the resulting morpheme sequence. We achieve accuracy per word of 90.42\% for tagging and 96.61\% for segmentation. Our system outperforms previous taggers for Hebrew, and the results are comparable with state-of-art taggers and word segmenters for Arabic, which use much larger annotated corpora.