Outlier Detection in Test Samples using Standard Deviation and Unsupervised Training Set Selection

Mohseni, N.; Nematzadeh, H.; Akbarib, E.; Motameni, H.

doi:10.5829/ije.2023.36.01a.14

Outlier Detection in Test Samples using Standard Deviation and Unsupervised Training Set Selection

Document Type : Original Article

Authors

¹ Department of Computer Engineering, Babol Branch, Islamic Azad University, Babol, Iran

² Department of Computer Engineering, Sari Branch, Islamic Azad University, Sari, Iran

10.5829/ije.2023.36.01a.14

Abstract

Outlier detection is a technique to identify and remove significantly different data from the more correct and consistent data in a data set. Outlier data can have negative impact on classification and clustering performance; that should be identified and removed to improve the classification efficiency. Regardless of whether a classifying technique classifies an outlier correctly, the very notion of identifying a data as outlier is of great significance. In this paper, a new approach is proposed for outlier data detection within a test data set along with unsupervised training set selection. The selected training set is used for two-step classification. After unsupervised clustering the training set, the closest cluster to a test sample is selected using the Euclidean distance measure. Then, the outlier in the test sample is identified with the concepts of standard deviation and mean value. The results showed by evaluating the distance of each sample of the test set with the new selected data set. the accuracy of the classifiers is enhanced after detection and elimination of outlier data.

Keywords

Main Subjects

Data Mining

References

Verbiest, N., Derrac, J., Cornelis, C., García, S. and Herrera, F., "Evolutionary wrapper approaches for training set selection as preprocessing mechanism for support vector machines: Experimental evaluation and support vector analysis", Applied Soft Computing, Vol. 38, (2016), 10-22. https://doi.org/10.1016/j.asoc.2015.09.006
Liu, C., Wang, W., Wang, M., Lv, F. and Konan, M., "An efficient instance selection algorithm to reconstruct training set for support vector machine", Knowledge-Based Systems, Vol. 116, (2017), 58-73. https://doi.org/10.1016/j.knosys.2016.10.031
Zechner, M. and Granitzer, M., "A competitive learning approach to instance selection for support vector machines", in International Conference on Knowledge Science, Engineering and Management, Springer., (2009), 146-157.
Mohammed, A.M., Onieva, E. and Woźniak, M., "Training set selection and swarm intelligence for enhanced integration in multiple classifier systems", Applied Soft Computing, Vol. 95, (2020), 106568. https://doi.org/10.1016/j.asoc.2020.106568
Mohseni, N., Nematzadeh, H. and Akbari, E., "Outlier detection in test samples and supervised training set selection", International Journal of Nonlinear Analysis and Applications, Vol. 12, No. 1, (2021), 701-712. https://dx.doi.org/10.22075/ijnaa.2021.4878
Ren, Z., Wu, B., Zhang, X. and Sun, Q., "Image set classification using candidate sets selection and improved reverse training", Neurocomputing, Vol. 341, (2019), 60-69. https://doi.org/10.1016/j.neucom.2019.03.010
Santiago-Ramirez, E., Gonzalez-Fraga, J.A., Gutierrez, E. and Alvarez-Xochihua, O., "Optimization-based methodology for training set selection to synthesize composite correlation filters for face recognition", Signal Processing: Image Communication, Vol. 43, (2016), 54-67. https://doi.org/10.1016/j.image.2016.02.002
Smiti, A., "A critical overview of outlier detection methods", Computer Science Review, Vol. 38, (2020), 100306. https://doi.org/10.1016/j.cosrev.2020.100306
Rath, S., Tripathy, A. and Tripathy, A.R., "Prediction of new active cases of coronavirus disease (covid-19) pandemic using multiple linear regression model", Diabetes & Metabolic Syndrome: Clinical Research & Reviews, Vol. 14, No. 5, (2020), 1467-1474. https://doi.org/10.1016/j.dsx.2020.07.045
Chen, T., Martin, E. and Montague, G., "Robust probabilistic pca with missing data and contribution analysis for outlier detection", Computational Statistics & Data Analysis, Vol. 53, No. 10, (2009), 3706-3716. https://doi.org/10.1016/j.csda.2009.03.014
Yang, Y., Fan, C., Chen, L. and Xiong, H., "Ipmod: An efficient outlier detection model for high-dimensional medical data streams", Expert Systems with Applications, Vol. 191, (2022), 116212. https://doi.org/10.1016/j.eswa.2021.116212
Christy, A., Gandhi, G.M. and Vaithyasubramanian, S., "Cluster based outlier detection algorithm for healthcare data", Procedia Computer Science, Vol. 50, (2015), 209-215. https://doi.org/10.1016/j.procs.2015.04.058
Lejeune, C., Mothe, J., Soubki, A. and Teste, O., "Shape-based outlier detection in multivariate functional data", Knowledge-Based Systems, Vol. 198, (2020), 105960. https://doi.org/10.1016/j.knosys.2020.105960
Tang, B. and He, H., "A local density-based approach for outlier detection", Neurocomputing, Vol. 241, (2017), 171-180. https://doi.org/10.1016/j.neucom.2017.02.039
Wang, B. and Mao, Z., "A dynamic ensemble outlier detection model based on an adaptive k-nearest neighbor rule", Information Fusion, Vol. 63, (2020), 30-40. https://doi.org/10.1016/j.inffus.2020.05.001
Karlapalem, K., Cheng, H., Ramakrishnan, N., Agrawal, R., Reddy, P.K., Srivastava, J. and Chakraborty, T., "Advances in knowledge discovery and data mining: 25th pacific-asia conference, pakdd 2021, virtual event, may 11–14, 2021, proceedings, part i, Springer Nature, Vol. 12712, (2021).
Yang, J., Rahardja, S. and Fränti, P., "Mean-shift outlier detection and filtering", Pattern Recognition, Vol. 115, (2021), 107874. https://doi.org/10.1016/j.patcog.2021.107874
Wahid, A. and Annavarapu, C.S.R., "Nanod: A natural neighbour-based outlier detection algorithm", Neural Computing and Applications, Vol. 33, No. 6, (2021), 2107-2123. https://doi.org/10.1007/s00521-020-05068-2
Acampora, G., Herrera, F., Tortora, G. and Vitiello, A., "A multi-objective evolutionary approach to training set selection for support vector machine", Knowledge-Based Systems, Vol. 147, (2018), 94-108. https://doi.org/10.1016/j.knosys.2018.02.022
Esfandian, N. and Hosseinpour, K., "A clustering-based approach for features extraction in spectro-temporal domain using artificial neural network", International Journal of Engineering, Transactons B: Applications , Vol. 34, No. 2, (2021), 452-457. doi: 10.5829/ije.2021.34.02b.17.
Beulah, D. and Vamsi Krishna Raj, P., "The ensemble of unsupervised incremental learning algorithm for time series data", International Journal of Engineering, Transactons B: Applications, Vol. 35, No. 2, (2022), 319-326. doi: 10.5829/ije.2022.35.02b.07.
Biglari, M., Mirzaei, F. and Hassanpour, H., "Feature selection for small sample sets with high dimensional data using heuristic hybrid approach", International Journal of Engineering, Transactons B: Applications, Vol. 33, No. 2, (2020), 213-220. doi: 10.5829/ije.2020.33.02b.05.
Fränti, P. and Sieranoja, S., "How much can k-means be improved by using better initialization and repeats?", Pattern Recognition, Vol. 93, (2019), 95-112. https://doi.org/10.1016/j.patcog.2019.04.014
Luchi, D., Rodrigues, A.L. and Varejão, F.M., "Sampling approaches for applying dbscan to large datasets", Pattern Recognition Letters, Vol. 117, (2019), 90-96. https://doi.org/10.1016/j.patrec.2018.12.010
Akbari, E., Dahlan, H.M., Ibrahim, R. and Alizadeh, H., "Hierarchical cluster ensemble selection", Engineering Applications of Artificial Intelligence, Vol. 39, (2015), 146-156. https://doi.org/10.1016/j.engappai.2014.12.005
Singh, D., Gosain, A. and Saha, A., "Weighted k‐nearest neighbor based data complexity metrics for imbalanced datasets", Statistical Analysis and Data Mining: The ASA Data Science Journal, Vol. 13, No. 4, (2020), 394-404. https://doi.org/10.1002/sam.11463
Chen, J., Zhang, C., Xue, X. and Liu, C.-L., "Fast instance selection for speeding up support vector machines", Knowledge-Based Systems, Vol. 45, (2013), 1-7. https://doi.org/10.1016/j.knosys.2013.01.031
Nematzadeh, Z., Ibrahim, R. and Selamat, A., "Improving class noise detection and classification performance: A new two-filter cndc model", Applied Soft Computing, Vol. 94, (2020), 106428. https://doi.org/10.1016/j.asoc.2020.106428
Speiser, J.L., Miller, M.E., Tooze, J. and Ip, E., "A comparison of random forest variable selection methods for classification prediction modeling", Expert Systems with Applications, Vol. 134, (2019), 93-101. https://doi.org/10.1016/j.eswa.2019.05.028
Zhou, Q., Zhou, H. and Li, T., "Cost-sensitive feature selection using random forest: Selecting low-cost subsets of informative features", Knowledge-Based Systems, Vol. 95, (2016), 1-11. https://doi.org/10.1016/j.knosys.2015.11.010
Nematzadeh, Z., Ibrahim, R., Selamat, A. and Nazerian, V., "The synergistic combination of fuzzy c-means and ensemble filtering for class noise detection", Engineering Computations, (2020). https://doi.org/10.1108/EC-05-2019-0242
Lee, D.K., In, J. and Lee, S., "Standard deviation and standard error of the mean", Korean Journal of Anesthesiology, Vol. 68, No. 3, (2015), 220-223. https://doi.org/10.4097%2Fkjae.2015.68.3.220

Volume 36, Issue 1
TRANSACTIONS A: Basics
January 2023
Pages 119-129

Article View: 423
PDF Download: 248

Outlier Detection in Test Samples using Standard Deviation and Unsupervised Training Set Selection

References

Volume 36, Issue 1
TRANSACTIONS A: Basics
January 2023
Pages 119-129

Files

Cited by

Share

How to cite

Statistics

Outlier Detection in Test Samples using Standard Deviation and Unsupervised Training Set Selection

References

Volume 36, Issue 1 TRANSACTIONS A: BasicsJanuary 2023Pages 119-129

Files

Cited by

Share

How to cite

Statistics

Volume 36, Issue 1
TRANSACTIONS A: Basics
January 2023
Pages 119-129