Repeated Record Ordering for Constrained Size Clustering

Document Type : Original Article

Author

School of Engineering, Damghan University, Damghan, Iran

Abstract

One of the main techniques used in data mining is data clustering, which has many applications in computer science, biology and social sciences. Constrained clustering is a type of clustering in which side information provided by the user is incorporated into current clustering algorithms. One of the well researched constrained clustering algorithms is called microaggregation. In a microaggregation technique, the algorithm divides the dataset into groups containing at least  members, where  is a user-defined parameter. The main application of microaggregation is in Statistical Disclosure Control (SDC) for privacy preserving data publishing. A microaggregation algorithm is qualified based on the sum of within-group squared error, . Unfortunately, it has been proven that the optimal microaggregation problem is NP-Hard in general, but the special univariate case can be solved optimally in polynomial time. Many heuristics exist for the general case of the problem that are founded on the univariate case. These techniques order multivariate records in a sequence. This paper proposes a novel method for record ordering. Starting from a conventional clustering algorithm, the proposed method repeatedly puts multivariate records into a sequence and then clusters them again. The process is repeated until no improvement is achieved. Extensive experiments have been conducted in this research to confirm the effectiveness of the proposed method for different parameters and datasets.

Keywords


1. Erfani, S.H. and Mortazavi, R., “A Novel Graph-modification Technique for User Privacy-preserving on Social Networks”, Journal of Telecommunications and Information Technology, Vol. 3, (2019), 27-38. DOI: 10.26636/jtit.2019.134319
2. Mortazavi, R. and Erfani, S.H., “An effective method for utility preserving social network graph anonymization based on mathematical modeling”, International Journal of Engineering, Transactions A: Basics Vol. 31, No. 10, (2018), 1624-1632. DOI: 10.5829/ije.2018.31.10a.03
3. Bypour, H., Farhadi, M. and Mortazavi R., “An Efficient Secret Sharing-based Storage System for Cloud-based Internet of Things”, International Journal of Engineering, Transactions B: Applications, Vol. 32, No. 8, (2019), 1117-1125. DOI: 10.5829/ije.2019.32.08b.07
4. Gheisari, M., Wang, G., Bhuiyan, M.Z.A. and Zhang W., “Mapp: A modular arithmetic algorithm for privacy preserving in IoT”, In IEEE International Symposium on Parallel and Distributed Processing with Applications and IEEE International Conference on Ubiquitous Computing and Communications (ISPA/IUCC), (2017), 897-903. DOI: 10.1109/ISPA/IUCC.2017.00137
5. Mortazavi, R. and Jalili, S., “Preference-based anonymization of numerical datasets by multi-objective microaggregation”,
Information Fusion, Vol. 25, (2015), 85-104. https://doi.org/10.1016/j.inffus.2014.10.003
6. Sweeney, L., “k-anonymity: A model for protecting privacy”, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, Vol. 10, No. 5, (2002), 557-570. https://doi.org/10.1142/S0218488502001648
7. Domingo-Ferrer, J. and Torra, V., “Ordinal, continuous and heterogeneous k-anonymity through microaggregation”, Data Mining and Knowledge Discovery, Vol. 11, No. 2, (2005), 195-212. https://doi.org/10.1007/s10618-005-0007-5
8. Oganian, A. and Domingo-Ferrer, J., “On the complexity of optimal microaggregation for statistical disclosure control”, Statistical Journal of the United Nations Economic Commission for Europe, Vol. 18, No. 4, (2001), 345-353.
9. Hansen, S.L. and Mukherjee, S., “A polynomial algorithm for optimal univariate microaggregation”, IEEE Transactions on Knowledge and Data Engineering, Vol. 15, No. 4, (2003), 1043-1044. DOI: 10.1109/TKDE.2003.1209020
10. Domingo-Ferrer, J., Martínez-Ballesté, A., Mateo-Sanz, J.M. and Sebé, F., “Efficient multivariate data-oriented microaggregation”, The VLDB Journal, Vol. 15, No. 4, (2006), 355-69. https://doi.org/10.1007/s00778-006-0007-0
11. Mortazavi, R., Jalili, S. and Gohargazi, H., “Multivariate microaggregation by iterative optimization”, Applied Intelligence, Vol. 39, No. 3, (2013), 529-544. https://doi.org/10.1007/s10489-013-0431-y
12. Aloise, D., Hansen, P., Rocha, C. and Santi, É., “Column generation bounds for numerical microaggregation”, Journal of Global Optimization, Vol. 60, No. 2, (2014), 165-182. https://doi.org/10.1007/s10898-014-0149-3
13. Domingo-Ferrer, J. and Mateo-Sanz, J.M., “Practical data-oriented microaggregation for statistical disclosure control”, IEEE Transactions on Knowledge and Data Engineering, Vol. 14, No. 1, (2002), 189-201.
14. Monedero, D.R., Mezher, A.M., Colomé, X.C., Forné, J. and Soriano, M., “Efficient k-anonymous microaggregation of multivariate numerical data via principal component analysis”, Information Sciences, Vol. 503, (2019), 417-443. https://doi.org/10.1016/j.ins.2019.07.042
15. Soria-Comas, J. and Domingo-Ferrer, J., “Differentially private data publishing via optimal univariate microaggregation and record perturbation”, Knowledge-Based Systems, Vol. 153, (2018), 78-90. https://doi.org/10.1016/j.knosys.2018.04.027
16. Mortazavi, R. and Jalili, S., “Fast data-oriented microaggregation algorithm for large numerical datasets”, Knowledge-Based Systems, Vol. 67, (2014), 195-205. https://doi.org/10.1016/j.knosys.2014.05.011
17. Khomnotai, L., Lin, J.L., Peng, Z.Q. and Santra, A.S., “Iterative Group Decomposition for Refining Microaggregation Solutions”, Symmetry, Vol. 10, No. 7, (2018), 262-274. https://doi.org/10.3390/sym10070262
18. Mortazavi, R. and Jalili, S., “Fine granular proximity breach prevention during numerical data anonymization”, Transactions on Data Privacy, Vol. 10, No. 2, (2017), 117-144.
19. Mortazavi, R. and Erfani, S.H., “GRAM: An efficient (k, l) graph anonymization method”, Expert Systems with Applications, Vol. 153, (2020), In press. https://doi.org/10.1016/j.eswa.2020.113454
20. Pelleg, D. and Moore, A.W., “X-means: Extending k-means with efficient estimation of the number of clusters”, In Proceedings of the Seventeenth International Conference on Machine Learning (ICML), (2000), 727-734.
21. Ünlü, R. and Xanthopoulos, P., “Estimating the number of clusters in a dataset via consensus clustering”, Expert Systems with Applications, Vol. 125, (2019), 33-39. https://doi.org/10.1016/j.eswa.2019.01.074