Masterthesis

Finished

pandas
Jupyter
keras
dimdrop
numpy
seaborn
matplotlib

September 2018 - June 2019

Extended abstract (EN)

Full thesis (NL)

dimdrop

machine learning

dimensionality reduction

master

autoencoder

About

My master thesis which completed my degree in Computer Science. It concerned research into dimensionality reduction and visualization techniques for high-dimensional data using autoencoders. The full text can be found in the online library of Ghent University. It is written in Dutch, but it contains an extended abstract in English.

For my thesis I implemented a lot of models in Python using the keras framework to be used for my experiments. Afterwards I released them as an open source pip package called dimdrop.

I wrote a lot of code for my thesis, so I decided afterwards to release it as an open source Python package.

Summary

Advances in flow cytometry allow to measure increasing amounts of parameters per cell, generating high-dimensional datasets. Visualizing these datasets in their raw form is tedious and confusing. Dimensionality reduction can be used to reduce the amount of dimensions in the datasets. When reducing to two dimensions we can visualize the dataset in one scatter plot.

I researched a lot of different dimensionality reduction methods, some of which were based on Autoencoders. Due to the design of an autoencoder with a middle layer with lower dimension than the outer layers it is possible to use this layer to perform dimensionality reduction. These methods can be divided into two classes.

Performing the reduction generates a model which can then be used to reduce additional data with a lower computational cost. Autoencoders are part of this category as they can be trained on the initial data and then the trained model can be used for other data. Other examples are PCA and Parametric t-SNE (uses a neural network with special loss function).

Performing the reduction just yields the reduction, the reductions are often better at preserving the local structure of the data. Examples are t-SNE and UMAP.

Additionally I researched about metrics to objectively score the reductions produced by these methods. These can also be dividen into two categories.

Scoring how well the neighborhoods are preserved in the data. Done by looking at the nearest (or furthest) neighbors of a data element and scoring how well these are kept in the reduction.

Scoring how well clustering can identify different classes in the data. Done by looking at a clustering result of the reduction and comparing it with the known classes of the data.

Then I made an extensive comparison of the models using different datasets and by scoring them on the metrics I researched.

Dimdrop

Summary

Dimensionality reduction

Parametric

Non Parametric

Evaluation

Neighborhood based

Clustering based

Comparison

Masterthesis

Dimensionality reduction using Autoencoders

Dimdrop

Summary

Dimensionality reduction

Parametric

Non Parametric

Evaluation

Neighborhood based

Clustering based

Comparison

Table of contents