M.Sc Thesis

M.Sc StudentManheim Yanay
SubjectVariational Data Augmentation with Deep Learning Generative
Models for lmbalanced Learning
DepartmentDepartment of Industrial Engineering and Management
Supervisor ASSOCIATE PROF. Tamir Hazan
Full Thesis textFull thesis text - English Version


Deep Learning models are known for their ability to successfully solve Machine Learning problems involving large amounts of data with evenly distributed classes. In contrast, imbalanced learning problems, where some classes are underrepresented, remains a challenge and often require creative solutions for them to be solved effectively.

This work will focus on balancing the classes distribution by generating examples from the underrepresented class, known as oversampling, by using deep generative models based on Generative Adversarial Networks (GANs) and Variational Auto Encoders (VAEs). Deep generative models are designed to work with high dimensional (possibly structured) data, as opposed to other methods such as Synthetic Minority Over-sampling Technique (SMOTE) and Adaptive Synthetic Sampling Approach (ADASYN).

This thesis proposes a direct variational generative discriminative network, that is able to generate instances conditioned on additional data, such as class label. Our method models an instance as a composition of discrete and continuous latent features in a probabilistic model. The discrete and continuous variables are sampled using Gumbel-max and standard Gaussian reparameterization respectively. The resulting loss objective still includes an argmax operation, which can be estimated with softmax based relaxations. Instead, the discrete loss is directly optimized by applying the direct optimization through argmax for discrete VAE approach. We use the encoding of data into a discrete component as a classifier, which means that when the training is done, we are left with both a generator and a classifier. By augmenting the data, the classifier is able to learn from a larger, balanced and diversified training set.

While training, synthetic data is generated and added to every mini batch training set in order to improve classification performance and generalization, which is measured by several evaluation metrics. Our proposed method effectiveness is demonstrated on two data sets and compared to the latest state of the art deep learning based generative-discriminative algorithms.