טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentPetersil Boaz
SubjectImproving Ranking Performance for a CQA Vertical Search
by Modeling Unmatched Terms and Learning a Term
Similarity Function
DepartmentDepartment of Electrical Engineering
Supervisor Professor Yacov Crammer
Full Thesis textFull thesis text - English Version


Abstract

The task in Information Retrieval (IR) is to retrieve relevant documents from a corpus

given a query. This is done by calculating a relevance score between the query

and a each of the documents in a candidate set. To calculate the relevance score

many works extract features from the query and candidate using Machine Learning

to learn a score function on (sometimes automatically) labeled data. However,

the features extracted mostly considered query terms matched to candidate terms,

while unmatched terms both in the query and candidate are only considered indirectly.

The first contribution of this work is a feature family designed to consider

the unmatched terms directly, motivated by the intuition that more unmatched information means a less relevant candidate. In order for our features to consider

only the truly unmatched terms (and not synonyms or inflections) our model first

used distributional similarity to match the query and candidate terms. But distributional similarity is obtained without considering the ranking model it is used

by. The second contribution of this work are two learning mechanisms to train

a term similarity function for a given ranking model, using only the ranking task

labels. In addition to being fitted for a specific ranking model, our learned similarity

function can take into account both semantic and string similarity, allowing

to match terms where distributional similarity function fails, such as misspelled

terms and unigram/bigram variations. As a test case, we consider vertical search

in Community-based Question Answering (CQA) sites fromWeb queries. Queries

that result in viewing CQA content often contain fine grained information needs

and benefit more from considering unmatched terms. Our experiments shows that

both our features and our learned similarity function improves retrieval performance

significantly.