Performance evaluation of different machine learning algorithms in presence of outliers using gene expression data

M Shahjaman; MM Rashid; MI Asifuzzaman; H Akter; SMS Islam; MNH Mollah

doi:10.3329/jbs.v28i0.44712

Authors

M Shahjaman Department of Statistics, Begum Rokeya University, Rangpur, Bangladesh
MM Rashid Department of Statistics, Begum Rokeya University, Rangpur, Bangladesh
MI Asifuzzaman Department of Statistics, Begum Rokeya University, Rangpur, Bangladesh
H Akter Department of Statistics, Begum Rokeya University, Rangpur, Bangladesh
SMS Islam Institutitute of Biological Sciences, University of Rajshahi, Bangladesh
MNH Mollah Bioinformatics Lab., Department of Statistics, University of Rajshahi, Bangladesh

DOI:

https://doi.org/10.3329/jbs.v28i0.44712

Keywords:

Classification, DE gene, GED, Outliers, Robustness

Abstract

Classification of samples into one or more populations is one of the main objectives of gene expression data (GED) analysis. Many machine learning algorithms were employed in several studies to perform this task. However, these studies did not consider the outliers problem. GEDs are often contaminated by outliers due to several steps involve in the data generating process from hybridization of DNA samples to image analysis. Most of the algorithms produce higher false positives and lower accuracies in presence of outliers, particularly for lower number of replicates in the biological conditions. Therefore, in this paper, a comprehensive study has been carried out among five popular machine learning algorithms (SVM, RF, Naïve Bayes, k-NN and LDA) using both simulated and real gene expression datasets, in absence and presence of outliers. Three different rates of outliers (5%, 10% and 50%) and six performance indices (TPR, FPR, TNR, FNR, FDR and AUC) were considered to investigate the performance of five machine learning algorithms. Both simulated and real GED analysis results revealed that SVM produced comparatively better performance than the other four algorithms (RF, Naïve Bayes, k-NN and LDA) for both small-and-large sample sizes.

J. bio-sci. 28: 69-80, 2020

Downloads

Download data is not yet available.

Abstract
143

PDF
170

Performance evaluation of different machine learning algorithms in presence of outliers using gene expression data

Authors

DOI:

Keywords:

Abstract

Downloads

Downloads

Published

How to Cite

Issue

Section

Information