Model Selection Strategies for Cancer Prediction from Gene Expression Data: A Beginner's Perspective on Machine Learning

Authors

  • Bandhan Sarker Department of Statistics, Faculty of Science, Gopalganj Science and Technology University, Gopalganj, 8105, Bangladesh
  • Md Matiur Rahaman Department of Statistics, Faculty of Science, Gopalganj Science and Technology University, Gopalganj, 8105, Bangladesh

DOI:

https://doi.org/10.3329/ijss.v25i2.85734

Keywords:

Machine learning (ML), Microarray gene expression, Support vector machine (SVM), linear discriminant analysis (LDA), Random Forest (RF).

Abstract

Microarray gene expression data are often classified by cell line or tumor type to assist in the diagnosis and prediction of human cancer. While microarray analysis demonstrates potential, choosing the most suitable machine learning approach is crucial for accurate cancer diagnosis and prediction. In this beginner’s guide, we outline how to select an optimal machine learning model for cancer phenotype prediction by comparing various existing methods. This study used three well-known machine learning methods: linear discriminant analysis, support vector machines, and random forest. To assess prediction performance, several performance metrics were considered, including model prediction accuracy (AC), the area under the curve (AUC), F-measure, the receiver operating characteristic (ROC) curve, and the precision-recall curve (PRC). Microarray gene expression data from two cancer types, leukemia and colon cancer were analysed. A cross-validation process with 100 resampling iterations was implemented to compute average performance measures (APM), ensuring the reliability of the results. Findings emphasize the significance of selecting the right machine learning model for accurate predictions of new samples. The methods employed provided satisfactory results, validated by various APM for both leukemia and colon cancer datasets. Notably, the random forest classifier exhibited the best performance in cancer prediction.

International Journal of Statistical Sciences, Vol. 25(2), November, 2025, pp 47-57

Abstract
10
PDF
4

Downloads

Published

2025-12-17

How to Cite

Sarker, B., & Rahaman, M. M. (2025). Model Selection Strategies for Cancer Prediction from Gene Expression Data: A Beginner’s Perspective on Machine Learning. International Journal of Statistical Sciences , 25(2), 47–57. https://doi.org/10.3329/ijss.v25i2.85734

Issue

Section

Original Articles