טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentMansour Saeb
SubjectCombining Character and Morpheme Based Models for Part-of-
Speech Tagging of Semitic Languages
DepartmentDepartment of Computer Science
Supervisors Professor Emeritus Alon Itai
Mr. Yoad Winter
Full Thesis textFull thesis text - English Version


Abstract

We propose an enhanced Part-of-Speech (POS) tagger of Semitic languages that combines word-based and character-based learning to treat Modern Standard Arabic (henceforth Arabic) and Modern Hebrew (henceforth Hebrew), using the same probabilistic model and architectural setting. We start out by porting an existing Hidden Markov Model POS tagger for Hebrew to Arabic by exchanging the morphological analyzer for Hebrew with Buckwalter's (2002) morphological analyzer for Arabic. This gives state-of-the-art accuracy (97.05%), comparable to Habash and Rambow’s (2005) analyzer-based POS tagger on the same Arabic datasets. However, further improvement of such analyzer-based tagging methods is hindered by the incomplete coverage of standard morphological analyzers (Bar-Haim et al., 2007). To overcome this coverage problem we supplement the output of Buckwalter's analyzer with synthetically constructed analyses that are proposed by a model which uses character information (Diab et al., 2004) in a way that is similar to Nakagawa's (2004) system for Chinese and Japanese. A version of this extended model that (unlike Nakagawa) incorporates synthetically constructed analyses also for known words achieves 97.13% accuracy on the standard Arabic test set, thus reducing the error rate by 3%. When repeating the experiments for the Hebrew data, the accuracy of the tagging improved from 94.73% to 94.81%, thus reducing the error by 2%. An online version of the tagger, including implementation details and various unreported results, can be found in http://www.mila.cs.technion.ac.il/Arabic.