טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentMagen Aviram
SubjectFind A Cure: Learning to Rank Articles for Molecular Queries
DepartmentDepartment of Computer Science
Supervisor Dr. Kira Radinsky


Abstract

The cost of developing new drugs is estimated at billions of dollars per year. Identification of new molecules for drugs involves scanning existing bio-medical literature for relevant information. As the potential drug molecule is novel, retrieval of relevant information using a simple direct search is less likely to be productive. Identifying relevant papers is therefore a more complex and challenging task,

which requires searching for information on molecules with similar characteristics to the novel drug. In  our research , we present the novel task of ranking documents based on novel molecule queries. Given a chemical molecular structure, we wish to rank medical papers that will contribute to a researcher's understanding of the novel molecule drug potential.

We present a set of ranking algorithms and molecular embeddings to address the task. An extensive evaluation of the algorithms is performed over the molecular embeddings, studying their performance on a benchmark retrieval corpus.

Additionally, we introduce a heterogeneous edge-labeled graph embedding approach to address the molecule ranking task. Our evaluation shows that the proposed embedding model can significantly improve molecule ranking methods.

One of the first steps of the drug discovery process is the generation of candidate molecular compounds. During this process, prior trials and publication regarding similar substances are reviewed in order to ensure the novelty of the compound and evaluate its characteristics. In this work, we present a similarity search algorithmic framework for drug discovery that can assist in this process. Given a novel chemical molecular structure, which has not been previously developed, we wish to identify and rank medical papers most relevant to the molecule.

The first step of implementing a ranking system focuses on selecting a proper representation or embedding of a query, in our case, a molecule. The embedding can then be used as an input to a supervised learning-to-rank classifier. We study both structural embedding, which relates to the chemical structure of the molecule and linguistic embedding, based on the contexts of the molecule in a large PubMed corpus. We present a novel methodology to combine the two approaches by constructing a graph G, with several types of nodes, representing text documents, structural molecular fingerprints, and molecules. The graph's edges represent the similarity between the different nodes. We leverage a graph embedding algorithm to produce node representations using random walks. This enables us to jointly learn representations of both documents and molecules.