Comparison of machine learning methods for the classification of high-throughput gene expression data


The high-throughput technology allows us to look at patterns of gene expression for thousands of genes at a single assay and examine the effect of many genes on an organism. This led to development of prognostic and predictive tests based on the classification of this high-dimensional data using various machine learning methods (for instance: support vector machines, random forest algorithms, linear and nonlinear regression methods etc.). However, the machine learning methods applied for the classification of high-dimensional data are highly problem dependent, so that there is no standard methodology for the given problem.

Problem Statement

The main aim of the project is to implement and compare various machine learning methods on high-throughput transcriptomics data (RNA seq, microarray). Further, we want to identify optimal genetic signatures between normal and mutated M. marinum (macrobacterium marinum which causes opportunistic infections in humans) based on the best performed machine learning method. The optimal genetic signatures will be then further used to reconstruct a gene regulatory network of M.marinum.

Required background

Master students who have the background in Bioinformatics, Informatics, Electrical Engineering, Physics or Applied Mathematics.

Programming skills for this project

R or Matlab


Dr. Nurgazy Sulaimanov