|M.Sc Student||Joshua Chaim Wortman|
|Subject||Film Classification Using Subtitles and Automatically|
Generated Language Factors
|Department||Department of Industrial Engineering and Management||Supervisor||Professor Emeritus Itai Alon|
|Full Thesis text|
We describe a novel approach to classify films into genres based only on text extracted from the dialog. Our dataset included subtitles from 1062 films and 14 genres categories; most films belong to several genres. Using a vector model methodology over a set of factors that describe the linguistic content, we achieve 79.4% classification accuracy. This work describes how to create these factors automatically using an unsupervised clustering methodology that groups unigram terms. Classification performance based on our factors reduced error in precision and recall by 8.1% and 6.6% respectively compared to classification using factors defined in the LIWC, a hand coded dictionary widely used for similar research. More significantly, we achieved this performance improvement using only 752 unique words representing 5.9% of the word content in the subtitles, whereas the LIWC uses 87.5%. Implications and suggestions for future work are discussed.