טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentHarel Shahar
SubjectPrototype-Based Chemical Design using Diversity-Driven
Generative Models
DepartmentDepartment of Computer Science
Supervisors Professor Shaul Markovitch
Dr. Kira Radinsky


Abstract

As the space of potential molecules for pharmacological treatment is literally infinite, designing a

new drug is an expensive and lengthy process. A common technique during drug discovery is

to start from a molecule which already has some of the desired properties. An interdisciplinary

team of scientists generates hypothesis about the required changes to the prototype. We call this

process a prototype-driven hypothesis generation.


In this work, we develop an algorithmic unsupervised approach for prototype-driven hypothesis

generation. Our method is inspired by the known analogy between a chemist understanding of a

compound and a language speaker understanding of a word (“Atoms are letters, molecules are

the words, supramolecular entities are the sentences and the chapters” [Jean-Marie Lehn 1995]),

which motivates the potential of Natural Language Processing for Computational Chemistry.

More formally, we design a conditional deep generative model for molecule generation with

diversity attention.

The model operates on a given molecule prototype and generates various molecules as candidates. The generated molecules should be novel and share desired properties with the prototype. Our model extends Variational Autoencoders to allow a conditional diverse sampling - sampling an example from the data distribution (drug-like molecules) which is closer to a given input. This allows sampling molecules closer to a prototype molecule, and thus increase probability of generating a valid drug with similar characteristics. Additionally, we add a diversity component that introduce parametrized diversity into the generation process, to allow the sampling to generate novelty with respect to the prototype.


We show that the molecules generated by the system are valid molecules which simultaneously

have strong connection to the prototype and are novel. In addition, we suggest several ranking

functions for the generated molecule population.

Out of the compounds generated by the system, we identified 35 FDA-approved drugs. As

an example, our system generated Isoniazid - one of the main drugs for Tuberculosis.