טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentFadida Hanna
SubjectAutomatic Extraction of Subcategorization Frames for
Hebrew
DepartmentDepartment of Computer Science
Supervisors Professor Emeritus Alon Itai
Dr. Shalom Wintner
Full Thesis text - in Hebrew Full thesis text - Hebrew Version


Abstract

This work automatically constructs from large text corpora the first subcategorization dictionary of Modern Hebrew. Using available resources, the corpora were morphologically analyzed and syntactically parsed. Standard collocation measures were employed to assess the degree to which potential complements tend to combine with each verb, focusing on a small set of potential complement types, which covers the vast majority of complement instances in the corpora. No attempt was made to construct full subcategorization frames; rather, each complement type was viewed in isolation, and its likelihood to combine with the verb was determined. The result is a wide-coverage dictionary of almost 3,000 verb lemmas, listing more than 6,500 verb-complement pairs, each with a statistically-derived score. The quality of the dictionary was evaluated both intrinsically and extrinsically. First, a small set of representative verbs and their canonical complements was manually constructed. The automatically-extracted dictionary achieved high precision and recall on this test set. Second, linguistic knowledge pertaining to verb subcategorization frames was incorporated in two computational tasks: reducing the ambiguity of PP-attachment and translating from Arabic to Hebrew. This demonstrated that knowledge derived from the dictionary is instrumental in significantly improving the accuracy of these two tasks. The contribution of this work is thus a digital, freely-available, wide-coverage and accurate verb subcategorization dictionary of Hebrew.