טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentJoshua Chaim Wortman
SubjectFilm Classification Using Subtitles and Automatically
Generated Language Factors
DepartmentDepartment of Industrial Engineering and Management
Supervisor Professor Emeritus Itai Alon
Full Thesis textFull thesis text - English Version


Abstract

We describe a novel approach to classify films into genres based only on text extracted from the dialog. Our dataset included subtitles from 1062 films and 14 genres categories; most films belong to several genres. Using a vector model methodology over a set of factors that describe the linguistic content, we achieve 79.4% classification accuracy. This work describes how to create these factors automatically using an unsupervised clustering methodology that groups unigram terms. Classification performance based on our factors reduced error in precision and recall by 8.1% and 6.6% respectively compared to classification using factors defined in the LIWC, a hand coded dictionary widely used for similar research. More significantly, we achieved this performance improvement using only 752 unique words representing 5.9% of the word content in the subtitles, whereas the LIWC uses 87.5%. Implications and suggestions for future work are discussed.