A Two-Level Semi-supervised Clustering Technique for News Articles

Document Type : Original Article

Authors

Faculty of Computer Engineering, Shahrood University of Technology, Shahrood, Iran

Abstract

The web and social media are overcrowded with news pieces in terms of amount and diversity. Document clustering is a useful technique that is widely used in organizing and managing data into smaller groups. One of the factors influencing the quality of clustering is the way documents are represented. Some traditional methods of document representation depend on word frequencies and create sparse and large-sized document vectors. These methods cannot preserve proximity information between documents. In addition, neural network-based methods that preserve proximity information suffer from poor interpretability. Conceptual text representation methods have overcome the shortcomings of previous methods, but semi-supervised text clustering does not currently use concept-based document representation. This paper presents a two-level semi-supervised text clustering method that uses labeled and unlabeled data simultaneously to achieve higher clustering quality. In the first level, documents are represented based on the concepts extracted from the raw corpus. Second, the semi-supervised clustering process applies unlabeled data to capture the overall structure of the clusters and a small amount of labeled data to adjust the center of the clusters. Experiments on the Reuters-21578 data collection show that the proposed model is superior to other semi-supervised approaches in both text classification and text clustering.

Keywords


  1. Forsati. R, Mahdavi. M, Kangavari. M, Safarkhani. B, "Web page clustering using harmony search optimization", In 2008 Canadian Conference on Electrical and Computer Engineering IEE 1601-1604, https://doi.org/10.1109/CCECE.2008.4564812.
  2. Bouras. C, Tsogkas. V, "A clustering technique for news articles using WordNet", Knowledge-Based Systems, Vol. 36, (2012) 115-128, https://doi.org/10.1016/J.KNOSYS.2012.06.015.
  3. Karypis. M, Kumar. V, Steinbach. M, "A comparison of document clustering techniques", (2000). the University of Minnesota Digital Conservancy, https://hdl.handle.net/11299/215421.
  4. Bobadilla. J, Ortega. F, Hernando. A, Gutiérrez. A, "Recommender systems survey", Knowledge-Based Syst., Vol. 46, (2013), 109-132,  https://doi.org/10.1016/j.knosys.2013.03.012.
  5. Barzegar Nozari. R, Koohi. H, Mahmodi. E, "A Novel Trust Computation Method Based on User Ratings to Improve the Recommendation", International Journal of Engineering, Transactions C: Aspects, Vol. 33, (2020), 377-386, https://doi.org/10.5829/IJE.2020.33.03C.02.
  6. Djenouri. Y, Belhadi. A, Fournier-Viger. P, Lin. J, "Fast and effective cluster-based information retrieval using frequent closed itemsets", Information Sciences, Vol. 453, (2018), 154-167, https://doi.org/10.1016/j.ins.2018.04.008.
  7. Joty. S, Carenini. G, Ng. R, "Topic segmentation and labeling in asynchronous conversations", The Journal of Artificial Intelligence Research, Vol. 47, (2013), 521-573, https://doi.org/10.1613/jair.3940.
  8. Li. Y, Guo. H, Zhang. Q, Gu. M, Yang. J, "Imbalanced text sentiment classification using universal and domain-specific knowledge", Knowledge-Based Systems,Vol. 160, (2018), 1-15, https://doi.org/10.1016/j.knosys.2018.06.019.
  9. Jacovi. A, Shalom. O, Goldberg. Y, "Understanding Convolutional Neural Networks for Text Classification", ArXiv, (2018), arXiv preprint arXiv:1809.08037.
  10. Le. Q, Mikolov. T, "Distributed Representations of Sentences and Documents", International Conference on Machine Learning. PMLR, Vol. 32, (2014), 1188-1196.
  11. Zhang. W, Yoshida. T, Tang. X, Wang. Q, "Text clustering using frequent itemsets", Knowledge-Based Systems, Vol. 23, (2010), 379-388, https://doi.org/10.1016/j.knosys.2010.01.011.
  12. Cozman. F, Cesar Cirelo. M, "Semi-Supervised Learning of Mixture Models" , Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), Washington DC, Vol. 4, (2003), 4-24.
  13. Luo. X, Liu. F, Yang. S, Wang. X, Zhou. Z, "Joint sparse regularization based Sparse Semi-Supervised Extreme Learning Machine (S3ELM) for classification" , Knowledge-Based Systems, Vol. 73, (2015), 149-160, https://doi.org/10.1016/j.knosys.2014.09.014.
  14. Dara. R, Kremer. S, Stacey. D, "Clustering unlabeled data with SOMs improves classification of labeled real-world data", Proc. International Joint Conference on Neural Networks, (2002), 2237-2242, https://doi.org/10.1109/ijcnn.2002.1007489.
  15. Zhang. W, Yang. Y, Wang. Q, "Using Bayesian regression and EM algorithm with missing handling for software effort prediction", Information and Software Technology, Vol. 58, (2015), 58-70, https://doi.org/10.1016/j.infsof.2014.10.005.
  16. Kim. HK, Kim. H, Cho. S, "Bag-of-concepts: Comprehending document representation through clustering words in distributed representation", Neurocomputing., Vol. 266, (2017), 336-352, https://doi.org/10.1016/j.neucom.2017.05.046.
  17. Deerwester. S, Dumais. S.T, Furnas. G.W, Landauer. T.K, Harshman. R, "Indexing by latent semantic analysis", Journal of the American Society for Information Science, Vol. 41, (1990), 391-407, https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9.
  18. Mikolov. T, Chen. K, Corrado. G, Dean. J, "Efficient estimation of word representations in vector space", International Conference on Learning Representations, ICLR, (2013).
  19. Edara. D.C, Vanukuri. L.P, Sistla. V, Kolli. V.K.K, "Sentiment analysis and text categorization of cancer medical records with LSTM", Journal of Ambient Intelligence and Humanized Computing, (2019), 1-17, https://doi.org/10.1007/s12652-019-01399-8.
  20. Dai. A.M, Olah. C, Le. Q, "Document Embedding with Paragraph Vectors", Arxiv, (2015), 1-8, arXiv preprint arXiv:1507.07998.
  21. Jia. C, Carson. M.B, Wang. X, Yu. J, "Concept decompositions for short text clustering by identifying word communities", Pattern Recognition, Vol. 76, (2018), 691-703, https://doi.org/10.1016/j.patcog.2017.09.045.
  22. Li. P, Mao. K, Xu. Y, Li. Q, Zhang. J, "Bag-of-Concepts representation for document classification based on automatic knowledge acquisition from probabilistic knowledge base", Knowledge-Based Systems, Vol. 193, (2020), https://doi.org/10.1016/j.knosys.2019.105436.
  23. Zhu. X.J, "Semi-Supervised Learning Literature Survey", (2005). http://digital.library.wisc.edu/1793/60444
  24. Basu. S, Bilenko. M, Mooney. R.J, "Comparing and Unifying Search-Based and Similarity-Based Approaches to Semi-Supervised Clustering", Proceedings of the ICML-2003 workshop on the continuum from labeled to unlabeled data in machine learning and data mining, (2003), 42-49.
  25. Zhang. W, Tang. X, Yoshida. T, "TESC: An approach to TExt classification using Semi-supervised Clustering", Knowledge-Based Systems, Vol. 75, (2015), 152-160, https://doi.org/10.1016/j.knosys.2014.11.028.
  26. Li. P, Deng. Z, "Use of distributed semi-supervised clustering for text classification", Journal of Circuits, Systems and Computers, Vol. 28, No. 8, (2019), 1-13, https://doi.org/10.1142/S0218126619501275.
  27. Gan. H, Fan. Y, Luo. Z, Zhang. Q, "Local homogeneous consistent safe semi-supervised clustering", Expert Systems with Applications, Vol. 97, (2018), 384-393, https://doi.org/10.1016/j.eswa.2017.12.046.
  28. Diaz-Valenzuela. I, Loia. V, Martin-Bautista. M.J, Senatore. S, Vila. M.A, "Automatic constraints generation for semisupervised clustering: experiences with documents classification", Soft Computing, Vol. 20, No. 6 (2016), 2329-2339, https://doi.org/10.1007/s00500-015-1643-3.
  29. Lu. M, Zhao. X.J, Zhang. L, Li. F.Z, "Semi-supervised concept factorization for document clustering", Information Sciences, Vol. 331, (2016), 86-98, https://doi.org/10.1016/j.ins.2015.10.038.
  30. Mikolov. T, Sutskever. I, Chen. K, Corrado. G, Dean. J, "Distributed Representations of Words and Phrases and their Compositionality", In Advances Neural Information Processing Systems, (2013) 3111-3119.
  31. Robertson. S, "Understanding inverse document frequency: On theoretical arguments for IDF", Journal of Documentation, Vol. 60, No. 5 (2004), 503-520, https://doi.org/10.1108/00220410410560582.
  32. Strehl. A, Ghosh. J, Mooney. R, "Impact of Similarity Measures on Web-page Clustering", Workshop on Artificial Intelligence for Web Search, Vol. 58, (2000), 64.