Robust Variable Selection in High-Dimensional Data: Mitigating Cellwise Contamination Through Comparative Analysis
Robust Variable Selection in High-Dimensional Data
DOI:
https://doi.org/10.3329/dujs.v73i2.82773Keywords:
Cellwise contamination, Robust variable selection, Gaussian Rank correlation, High-dimensional regression, independent contamination model, Sparse robust regressionAbstract
The proliferation of high-dimensional data has heightened challenges posed by cellwise outliers, where contamination in individual cells distorts analyses more pervasively than traditional rowwise outliers. This study conducts a comprehensive comparison of robust variable selection methods under cellwise contamination, evaluating four rank-based techniques (ALGR, ALRP, LGR, LRP) against traditional approaches (Lasso, Adaptive Lasso, sLTS). Simulations under varying correlation structures, contamination rates (2%, 5%, 10%), and outlier magnitudes (γ = 2, 6, 10) demonstrate that Gaussian Rank correlation-based methods (ALGR, LGR) achieve superior F1 scores, balancing high true positives and low false positives. Real-data applications on life expectancy and crime datasets corroborate these findings, with ALGR and LGR maintaining robustness in low- and high-dimensional settings. Results emphasize the critical need for methods resilient to cellwise contamination in fields reliant on accurate high-dimensional data analysis, such as healthcare and genomics.
Dhaka Univ. J. Sci. 73(2): 143-150, 2025 (July)