Feature Selection for Small Sample Sets with High Dimensional Data Using Heuristic Hybrid Approach

Document Type : Original Article

Authors

Computer Engineering and IT Department, Shahrood University of Technology, Shahrood, Iran

Abstract

Feature selection can significantly be decisive when analyzing high dimensional data, especially with a small number of samples. Feature extraction methods do not have decent performance in these conditions. With small sample sets and high dimensional data, exploring a large search space and learning from insufficient samples becomes extremely hard. As a result, neural networks and clustering algorithms perform poorly on this kind of data. In this paper, a novel hybrid feature selection technique is proposed, which can reduce drastically the number of features with an acceptable loss of prediction accuracy. The proposed approach operates in multiple stages, starting by removing irrelevant features with a low discrimination power, and then eliminating the ones with low variation range. Afterward, among each set of features with high cross-correlation, a single feature that is strongly correlated with the output is kept. Finally, a Genetic Algorithm with a customized cost function is provided to select a small subset of the remainder of features. To show the effectiveness of the proposed approach, we investigated two challenging case studies with sample set sizes of about 100 and the number of features larger than 1000. The experimental results look promising as they showed a percentage decrease of more than 99% in the number of features, with a prediction accuracy of more than 92%.

Keywords


1. Venkatesh, B., Anuradha, J., ‘A Review of Feature Selection and
Its Methods’, Cybernetics and Information Technologies, Vol.
19, No. 1, (2019), 3–26.  
2. Li, J., Cheng, K., Wang, S., et al., ‘Feature Selection: A Data
Perspective’, ACM Computing Surveys, Vol. 50, No. 6, (2018), 94.  
3. Cai, J., Luo, J., Wang, S., Yang, S., ‘Feature selection in machine
learning : A new perspective’, Neurocomputing, Vol. 0, (2018), 1–10.  
4. Li, J., Liu, H., Science, C., ‘Challenges of Feature Selection for
Big Data’, IEEE Intelligent Systems, Vol. 32, No. 2, (2017), 9– 15.  
5. Jovi, A., Brki, K., Bogunovi, N., ‘A review of feature selection
methods with applications’, , in ‘38th International Convention on
Information and Communication Technology, Electronics and
Microelectronics’ (2015), 1200–1205 
6. Asir, D., Appavu, S., Jebamalar, E., ‘Literature Review on
Feature Selection Methods for High-Dimensional Data’,
International Journal of Computer Applications, Vol. 136, No.
1, (2016), 9–17.  
7. Mazimpaka, J.D., Timpf, S., ‘Trajectory data mining : A review
of methods and applications’, Journal of Spatial Information
Science, Vol. 13, (2016), 61–99.  
8. Hamidi, H., Daraei, A., ‘Analysis of Pre-processing and Postprocessing
Methods and Using Data Mining to Diagnose Heart
Diseases’, International Journal of Engineering-Transactions
A: Basics, Vol. 29, No. 7, (2016), 921–930.  
9. Kumar, S., Sahoo, G., ‘A Random Forest Classifier based on
Genetic Algorithm for Cardiovascular Diseases Diagnosis’,
International Journal of Engineering-Transactions B:
Applications, Vol. 30, No. 11, (2017), 1723–1729.  
10. Liu, H., Motoda, H., ‘Computational Methods of Feature 
Selection’,  (CRC Press, 2007)
11. Liu, H., Setiono, R., ‘A Probabilistic Approach to Feature 
Selection - A Filter Solution’, , in ‘Proceedings of 13th
International Conference on Machine Learning’ (1996), 319–327 
12. Fayyad, U.M., Irani, K.B., ‘The attribute selection problem in
decision tree generation’, , in ‘AAAI’ (1992), 104–110 
13. Kalpana, P., Mani, K., ‘A New Hybrid Framework for Filter
based Feature Selection using Information Gain and Symmetric
Uncertainty’, International Journal of EngineeringTransactions
B: Applications, Vol. 30, No. 5, (2017), 659–667.

14. Ch, V., Asvestas, P.A., Delibasis, K.K., Matsopoulos, G.K., ‘A
classification system based on a new wrapper feature selection
algorithm for the diagnosis of primary and secondary
polycythemia’, Computers in Biology and Medicine, Vol. 43,
(2013), 2118–2126.  
15. Kohavi, R., John, H., ‘Wrappers for feature subset selection’,
Artificial Intelligence, Vol. 97, (1997), 273–324.  
16. Dy, J.G., Brodley, C.E., ‘Feature Subset Selection and Order
Identification for Unsupervised Learning’, , in ‘Proceedings of
17th International Conference of Machine Learning’ (2000), 247–
254 
17. Yang, Y., Pedersen, J.O., ‘A Comparative Study on Feature
Selection in Text Categorization’, , in ‘Proceedings of 14th
International Conference on Machine Learning’ (1997), 412–420 
18. Mohsenzadeh, Y., Sheikhzadeh, H., Member, S., Reza, A.M.,
Member, S., Kalayeh, M.M., ‘The Relevance Sample-Feature
Machine : A Sparse Bayesian Learning Approach to Joint
Feature-Sample Selection’, IEEE Transactions on Cybernetics,
Vol. 43, No. 6, (2013), 2241–2254.  
19. Yan, K., Zhang, D., ‘Feature Selection and Analysis on
Correlated Gas Sensor Data with Recursive Feature Elimination’,
Sensors & Actuators: B. Chemical, Vol. 212, (2015), 353–363.  
20. Jain, A., Zongker, D., ‘Feature Selection Evaluation, Application,
and Small Sample Performance.pdf’, IEEE Transactions on
Pattern Analysis and Machine Intelligence, Vol. 19, No. 2,
(1997), 153–158.  
21. Talavera, L., ‘Feature Selection as a Preprocessing Step for
Hierarchical Clustering’, , in ‘Proceedings of 25th International
Conference of Machine Learning’ (1999), 389–397 
22. Das, S., ‘Filters , Wrappers and a Boosting-Based Hybrid for
Feature Selection’, , in ‘Engineering’ (2001), 74–81 
23. Biesiada, J., Duch, W., ‘Feature Selection for High-Dimensional
Data – A Pearson Redundancy Based Filter’, , in ‘Advances in
Soft Computing’ (2007), 242–249 
24. Estévez, P.A., Member, S., Tesmer, M., Perez, C.A., Member, S.,
Zurada, J.M., ‘Normalized Mutual Information Feature
Selection’, IEEE Transactions on Neural Networks, Vol. 20,
No. 2, (2009), 189–201.  
25. Vinh, L.T., Thang, N.D., Lee, Y., ‘An Improved Maximum
Relevance and Minimum Redundancy Feature Selection
Algorithm Based on Normalized Mutual Information’, , in
‘Proceedings of 10th IEEE/IPSJ International Symposium on
Applications and the Internet’ (2010), 395–398 
26. Quinlan, J.R., ‘Bagging, Boosting, and C4.5’, AAAI/IAAI, Vol.
1, (2006), 725–730.  
27. Gheyas, I.A., Smith, L.S., ‘Feature Subset Selection in Large
Dimensionality Domains’, Pattern Recognition, Vol. 43, No. 1,
(2010), 5–13.  
28. Nekoei, M., Mohammadhosseini, M., Pourbasheer, E., ‘QSAR
study of VEGFR-2 inhibitors by using genetic algorithm-multiple
linear regressions (GA-MLR) and genetic algorithm-support
vector machine (GA-SVM): A comparative approach’, Medicinal
Chemistry Research, Vol. 24, No. 7, (2015), 3037–3046.