Outlier Detection in Test Samples using Standard Deviation and Unsupervised Training Set Selection

Document Type : Original Article

Authors

1 Department of Computer Engineering, Babol Branch, Islamic Azad University, Babol, Iran

2 Department of Computer Engineering, Sari Branch, Islamic Azad University, Sari, Iran

Abstract

Outlier detection is a technique to identify and remove significantly different data from the more correct and consistent data in a data set. Outlier data can have negative impact on classification and clustering performance; that should be identified and removed to improve the classification efficiency. Regardless of whether a classifying technique classifies an outlier correctly, the very notion of identifying a data as outlier is of great significance.   In this paper, a new approach is proposed for outlier data detection within a test data set along with unsupervised training set selection. The selected training set is used for two-step classification. After unsupervised clustering the training set, the closest cluster to a test sample is selected using the Euclidean distance measure. Then, the outlier in the test sample is identified with the concepts of standard deviation and mean value.  The results showed by evaluating the distance of each sample of the test set with the new selected data set. the accuracy of the classifiers is enhanced after detection and elimination of outlier data.

Keywords

Main Subjects


  1. Verbiest, N., Derrac, J., Cornelis, C., García, S. and Herrera, F., "Evolutionary wrapper approaches for training set selection as preprocessing mechanism for support vector machines: Experimental evaluation and support vector analysis", Applied Soft Computing, Vol. 38, (2016), 10-22. https://doi.org/10.1016/j.asoc.2015.09.006
  2. Liu, C., Wang, W., Wang, M., Lv, F. and Konan, M., "An efficient instance selection algorithm to reconstruct training set for support vector machine", Knowledge-Based Systems, Vol. 116, (2017), 58-73. https://doi.org/10.1016/j.knosys.2016.10.031
  3. Zechner, M. and Granitzer, M., "A competitive learning approach to instance selection for support vector machines", in International Conference on Knowledge Science, Engineering and Management, Springer., (2009), 146-157.
  4. Mohammed, A.M., Onieva, E. and Woźniak, M., "Training set selection and swarm intelligence for enhanced integration in multiple classifier systems", Applied Soft Computing, Vol. 95, (2020), 106568. https://doi.org/10.1016/j.asoc.2020.106568
  5. Mohseni, N., Nematzadeh, H. and Akbari, E., "Outlier detection in test samples and supervised training set selection", International Journal of Nonlinear Analysis and Applications, Vol. 12, No. 1, (2021), 701-712. https://dx.doi.org/10.22075/ijnaa.2021.4878
  6. Ren, Z., Wu, B., Zhang, X. and Sun, Q., "Image set classification using candidate sets selection and improved reverse training", Neurocomputing, Vol. 341, (2019), 60-69. https://doi.org/10.1016/j.neucom.2019.03.010
  7. Santiago-Ramirez, E., Gonzalez-Fraga, J.A., Gutierrez, E. and Alvarez-Xochihua, O., "Optimization-based methodology for training set selection to synthesize composite correlation filters for face recognition", Signal Processing: Image Communication, Vol. 43, (2016), 54-67. https://doi.org/10.1016/j.image.2016.02.002
  8. Smiti, A., "A critical overview of outlier detection methods", Computer Science Review, Vol. 38, (2020), 100306. https://doi.org/10.1016/j.cosrev.2020.100306
  9. Rath, S., Tripathy, A. and Tripathy, A.R., "Prediction of new active cases of coronavirus disease (covid-19) pandemic using multiple linear regression model", Diabetes & Metabolic Syndrome: Clinical Research & Reviews, Vol. 14, No. 5, (2020), 1467-1474. https://doi.org/10.1016/j.dsx.2020.07.045
  10. Chen, T., Martin, E. and Montague, G., "Robust probabilistic pca with missing data and contribution analysis for outlier detection", Computational Statistics & Data Analysis, Vol. 53, No. 10, (2009), 3706-3716. https://doi.org/10.1016/j.csda.2009.03.014
  11. Yang, Y., Fan, C., Chen, L. and Xiong, H., "Ipmod: An efficient outlier detection model for high-dimensional medical data streams", Expert Systems with Applications, Vol. 191, (2022), 116212. https://doi.org/10.1016/j.eswa.2021.116212
  12. Christy, A., Gandhi, G.M. and Vaithyasubramanian, S., "Cluster based outlier detection algorithm for healthcare data", Procedia Computer Science, Vol. 50, (2015), 209-215. https://doi.org/10.1016/j.procs.2015.04.058
  13. Lejeune, C., Mothe, J., Soubki, A. and Teste, O., "Shape-based outlier detection in multivariate functional data", Knowledge-Based Systems, Vol. 198, (2020), 105960. https://doi.org/10.1016/j.knosys.2020.105960
  14. Tang, B. and He, H., "A local density-based approach for outlier detection", Neurocomputing, Vol. 241, (2017), 171-180. https://doi.org/10.1016/j.neucom.2017.02.039
  15. Wang, B. and Mao, Z., "A dynamic ensemble outlier detection model based on an adaptive k-nearest neighbor rule", Information Fusion, Vol. 63, (2020), 30-40. https://doi.org/10.1016/j.inffus.2020.05.001
  16. Karlapalem, K., Cheng, H., Ramakrishnan, N., Agrawal, R., Reddy, P.K., Srivastava, J. and Chakraborty, T., "Advances in knowledge discovery and data mining: 25th pacific-asia conference, pakdd 2021, virtual event, may 11–14, 2021, proceedings, part i, Springer Nature, Vol. 12712,  (2021).
  17. Yang, J., Rahardja, S. and Fränti, P., "Mean-shift outlier detection and filtering", Pattern Recognition, Vol. 115, (2021), 107874. https://doi.org/10.1016/j.patcog.2021.107874
  18. Wahid, A. and Annavarapu, C.S.R., "Nanod: A natural neighbour-based outlier detection algorithm", Neural Computing and Applications, Vol. 33, No. 6, (2021), 2107-2123. https://doi.org/10.1007/s00521-020-05068-2
  19. Acampora, G., Herrera, F., Tortora, G. and Vitiello, A., "A multi-objective evolutionary approach to training set selection for support vector machine", Knowledge-Based Systems, Vol. 147, (2018), 94-108. https://doi.org/10.1016/j.knosys.2018.02.022
  20. Esfandian, N. and Hosseinpour, K., "A clustering-based approach for features extraction in spectro-temporal domain using artificial neural network", International Journal of Engineering, Transactons B: Applications , Vol. 34, No. 2, (2021), 452-457. doi: 10.5829/ije.2021.34.02b.17.
  21. Beulah, D. and Vamsi Krishna Raj, P., "The ensemble of unsupervised incremental learning algorithm for time series data", International Journal of Engineering, Transactons B: Applications, Vol. 35, No. 2, (2022), 319-326. doi: 10.5829/ije.2022.35.02b.07.
  22. Biglari, M., Mirzaei, F. and Hassanpour, H., "Feature selection for small sample sets with high dimensional data using heuristic hybrid approach", International Journal of Engineering, Transactons B: Applications, Vol. 33, No. 2, (2020), 213-220. doi: 10.5829/ije.2020.33.02b.05.
  23. Fränti, P. and Sieranoja, S., "How much can k-means be improved by using better initialization and repeats?", Pattern Recognition, Vol. 93, (2019), 95-112. https://doi.org/10.1016/j.patcog.2019.04.014
  24. Luchi, D., Rodrigues, A.L. and Varejão, F.M., "Sampling approaches for applying dbscan to large datasets", Pattern Recognition Letters, Vol. 117, (2019), 90-96. https://doi.org/10.1016/j.patrec.2018.12.010
  25. Akbari, E., Dahlan, H.M., Ibrahim, R. and Alizadeh, H., "Hierarchical cluster ensemble selection", Engineering Applications of Artificial Intelligence, Vol. 39, (2015), 146-156. https://doi.org/10.1016/j.engappai.2014.12.005
  26. Singh, D., Gosain, A. and Saha, A., "Weighted k‐nearest neighbor based data complexity metrics for imbalanced datasets", Statistical Analysis and Data Mining: The ASA Data Science Journal, Vol. 13, No. 4, (2020), 394-404. https://doi.org/10.1002/sam.11463
  27. Chen, J., Zhang, C., Xue, X. and Liu, C.-L., "Fast instance selection for speeding up support vector machines", Knowledge-Based Systems, Vol. 45, (2013), 1-7. https://doi.org/10.1016/j.knosys.2013.01.031
  28. Nematzadeh, Z., Ibrahim, R. and Selamat, A., "Improving class noise detection and classification performance: A new two-filter cndc model", Applied Soft Computing, Vol. 94, (2020), 106428. https://doi.org/10.1016/j.asoc.2020.106428
  29. Speiser, J.L., Miller, M.E., Tooze, J. and Ip, E., "A comparison of random forest variable selection methods for classification prediction modeling", Expert Systems with Applications, Vol. 134, (2019), 93-101. https://doi.org/10.1016/j.eswa.2019.05.028
  30. Zhou, Q., Zhou, H. and Li, T., "Cost-sensitive feature selection using random forest: Selecting low-cost subsets of informative features", Knowledge-Based Systems, Vol. 95, (2016), 1-11. https://doi.org/10.1016/j.knosys.2015.11.010
  31. Nematzadeh, Z., Ibrahim, R., Selamat, A. and Nazerian, V., "The synergistic combination of fuzzy c-means and ensemble filtering for class noise detection", Engineering Computations, (2020). https://doi.org/10.1108/EC-05-2019-0242
  32. Lee, D.K., In, J. and Lee, S., "Standard deviation and standard error of the mean", Korean Journal of Anesthesiology,  Vol. 68, No. 3, (2015), 220-223. https://doi.org/10.4097%2Fkjae.2015.68.3.220