The Construction of Scalable Decision Tree based on Fast Splitting and J-Max Pre-Pruning on Large Datasets

Document Type : Original Article

Authors

1 Department of Computer Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran

2 Computer Department, Engineering Campus, Yazd University.

Abstract

The decision tree is one of the most important algorithms in the classification which offers a comprehensible model of data. In building a tree we may encounter a memory limitation. The present study aimed to implement an incremental scalable approach based on fast splitting and present pruning to construct the decision tree on a large dataset as the complexity of the tree decreases. The proposed algorithm constructs the decision tree without storing the entire dataset in the primary memory by using a minimum number of parameters. Furthermore, the J-max Pre pruning method was used to reduce the complexity with acceptable results. Experimental results show that this approach can create a balance between the accuracy and complexity of the tree and overcome the difficulties of the complexity of the tree. In spite of the appropriate accuracy and time, the proposed algorithm could produce a decision tree with less complexity on the large dataset.

Keywords


  1. Agarwal, S., “Data mining: Data mining concepts and techniques” In 2013 International Conference on Machine Intelligence and Research Advancement, (2013), 203-207, IEEE, doi: 10.1109/ICMIRA.2013.45.
  2. Qolipour, F., Ghasemzadeh, M. and Mohammad-Karimi, N., "The Predictability of Tree-based Machine Learning Algorithms in the Big Data Context." International Journal of Engineering, Transactions A: Basics, Vol. 34, No. 01, (2021), 82-89, doi: 10.5829/ije.2021.34.01a.10.
  3. Chen, Y.L., Wu, C.C. and Tang, K. "Time-constrained cost-sensitive decision tree induction." Information Sciences, Vol. 354, (2016), 140-152, doi: 10.1016/j.ins.2016.03.022.
  4. Priyanka and Kumar, D., "Decision tree classifier: a detailed survey." International Journal of Information and Decision Sciences, Vol. 12, No. 3, (2020), 246-269, doi: 10.38094/jastt20165.
  5. Franco-Arcega, A., Carrasco-Ochoa, J.A., Sánchez-Díaz, G. and Martínez-Trinidad, J.F., "Decision tree induction using a fast splitting attribute selection for large datasets." Expert Systems with Applications, Vol. 38, No. 11, (2011): 14290-14300, doi: 10.1016/j.eswa.2011.05.087
  6. Stahl, F. and Bramer, M., "Jmax-pruning: A facility for the information theoretic pruning of modular classification rules." Knowledge-Based Systems, Vol. 29, (2012), 12-19, doi; 10.1016/j.knosys.2011.06.016.
    1. Grossi, V., Romei, A. and Turini, F., "Survey on using constraints in data mining." Data Mining and Knowledge Discovery, Vol. 31, No. 2, (2017): 424-464, doi: 10.1007/s10618-016-0480-z.
  7. Chandra, B., Kothari, R. and Paul, P., "A new node splitting measure for decision tree construction." Pattern Recognition, Vol. 43, No. 8, (2010), 2725-2731, doi: 10.1016/j.patcog.2010.02.025.
  8. Lomax, S. and Vadera, S., "A survey of cost-sensitive decision tree induction algorithms." ACM Computing Surveys (CSUR), Vol. 45, No. 2, (2013), 1-35, doi: 10.1145/2431211.2431215.
  9. Brunello, A., Marzano, E., Montanari, A. and Sciavicco, G., "Decision tree pruning via multi-objective evolutionary computation." International Journal of Machine Learning and Computing, Vol. 7, No. 6, (2017), 167-175, doi: 10.18178/IJMLC.2017.7.6.641.
  10. Bramer, M., "Using J-pruning to reduce overfitting in classification trees." In Research and Development in Intelligent Systems XVIII, Springer, London, (2002), 25-38, doi: 10.1007/978-1-4471-0119-2_3.
  11. Manapragada, C., Webb, G. I., and Salehi, M., "Extremely fast decision tree." In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, (2018), 1953-1962, doi: 10.1145/3219819.3220005.
  12. Gehrke, J., Ganti, V., Ramakrishnan, R. and Loh, W.Y., "BOAT-optimistic decision tree construction." ACM, Vol. 28. No. 2, (1999), doi: 10.1145/304182.304197.
  13. Mehta, M., Agrawal, R. and Rissanen, J., "SLIQ: A fast scalable classifier for data mining" International conference on extending database technology, Springer, Berlin, Heidelberg, (1996), 18-32, doi: 10.1007/BFb0014141.
  14. Zaki, M. J. "Parallel and distributed data mining: An introduction." In Large-scale parallel data mining, Springer, Berlin, Heidelberg. (2000), 1-23, doi: 10.1007/3-540-46502-2_1.
  15. Gehrke, J., Ramakrishnan, R. and Ganti, V., "RainForest—a framework for fast decision tree construction of large datasets." Data Mining and Knowledge Discovery, Vol. 4, No. 2, (2000), 127-162, doi: 10.1023/A:1009839829793.
  16. Hulten, G. and Domingos, P., "Mining Decision Trees from Streams." In Data Stream Management, Springer, Berlin, Heidelberg, (2016), 189-208, doi: 10.1007/978-3-540-28608-0_9.
  17. Domingos, P. and Hulten, G.,"Mining high-speed data streams." In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, (2000), 71-80, doi: 10.1145/347090.347107.
    1. Ranka, S. and Singh, V., "CLOUDS: A decision tree classifier for large datasets." In Proceedings of the 4th Knowledge Discovery and Data Mining Conference, Vol. 2, No. 8, 1998.
  18. Yang, B., Wang, T., Yang, D. and Chang, L., "BOAI: Fast alternating decision tree induction based on bottom-up evaluation." In Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, Berlin, Heidelberg, (2008), 405-416, doi: 10.1007/978-3-540-68125-0_36.
    1. Blake, C.L. and Merz, C.J., "UCI Repository of machine learning databases [http://www. ics. uci. edu/~ mlearn/MLRepository. html]. Irvine, CA: University of California." Department of Information and Computer Science 55 (2008).