A Multiple Kernel Learning based Model with Clustered Features for Cancer Stage Detection using Gene Datasets

Document Type : Original Article


Department of Electrical and Computer Engineering, Nooshirvani University of Technology, Babol, Iran


Genomic data is used in various fields of medicine including diagnosis, prediction, and treatment of diseases. Stage detection of cancer progression is crucial for treating patients because the mortality rate of cancer is higher when it is diagnosed in the late stages. Furthermore, the type of treatment varies depending on the cancer stage. This paper presents a Multiple Kernel Learning based algorithm to predict the stage of cancer using genomic data. Because of the high dimension of genomic data, the curse of dimensionality may degrade the stage prediction. To reduce the dimension, features are clustered first in the proposed algorithm. Then, the original data samples are clustered into smaller subsets with reduced dimensions based on the computed feature clusters. Afterward, for each subset, a kernel matrix is calculated. The kernel matrices are weighted and then combined linearly. Finally, a cancer stage prediction model is trained using the combined kernel matrix and Support Vector Machine. The proposed algorithm is compared with the baseline methods. The classification accuracy of the proposed method outperforms the other methods in 13 cancer groups of 15  from the cancer genome atlas program (TCGA) dataset.


Main Subjects

  1. Chicco, D., "Ten quick tips for machine learning in computational biology", BioData Mining, Vol. 10, No. 1, (2017), 35. doi: 10.1186/s13040-017-0155-3.
  2. Yang, Y., Gao, J., Wang, J., Heffernan, R., Hanson, J., Paliwal, K. and Zhou, Y., "Sixty-five years of the long march in protein secondary structure prediction: The final stretch?", Briefings in Bioinformatics, Vol. 19, No. 3, (2018), 482-494. doi: 10.1093/bib/bbw129.
  3. Larranaga, P., Calvo, B., Santana, R., Bielza, C., Galdiano, J., Inza, I., Lozano, J.A., Armananzas, R., Santafé, G. and Pérez, A., "Machine learning in bioinformatics", Briefings in Bioinformatics, Vol. 7, No. 1, (2006), 86-112. doi: 10.1093/bib/bbk007.
  4. Wang, J.T., Zaki, M.J., Toivonen, H.T. and Shasha, D., Introduction to data mining in bioinformatics, in Data mining in bioinformatics. 2005, Springer.3-8.
  5. Zhao, D., Liu, H., Zheng, Y., He, Y., Lu, D. and Lyu, C., "A reliable method for colorectal cancer prediction based on feature selection and support vector machine", Medical & Biological Engineering & Computing, Vol. 57, (2019), 901-912. doi: 10.1007/s11517-018-1930-0.
  6. Bhalla, S., Chaudhary, K., Kumar, R., Sehgal, M., Kaur, H., Sharma, S. and Raghava, G.P., "Gene expression-based biomarkers for discriminating early and late stage of clear cell renal cancer", Scientific Reports, Vol. 7, No. 1, (2017), 44997. doi: 10.1038/srep44997.
  7. Huo, Y., Xin, L., Kang, C., Wang, M., Ma, Q. and Yu, B., "Sgl-svm: A novel method for tumor classification via support vector machine with sparse group lasso", Journal of Theoretical Biology, Vol. 486, (2020), 110098. doi: j.jtbi.2019.110098.
  8. Rani, R.R. and Ramyachitra, D., "Microarray cancer gene feature selection using spider monkey optimization algorithm and cancer classification using svm", Procedia Computer Science, Vol. 143, (2018), 108-116. doi: 10.1016/j.procs.2018.10.358.
  9. Xu, G., Zhang, M., Zhu, H. and Xu, J., "A 15-gene signature for prediction of colon cancer recurrence and prognosis based on svm", Gene, Vol. 604, (2017), 33-40. doi: 10.1016/j.gene.2016.12.016.
  10. Medjahed, S.A., Saadi, T.A., Benyettou, A. and Ouali, M., "Kernel-based learning and feature selection analysis for cancer diagnosis", Applied Soft Computing, Vol. 51, (2017), 39-48. doi: 10.1016/j.asoc.2016.12.010.
  11. Du, W., Cao, Z., Song, T., Li, Y. and Liang, Y., "A feature selection method based on multiple kernel learning with expression profiles of different types", BioData mining, Vol. 10, No. 1, (2017), 1-16. doi: 10.1186/s13040-017-0124-x.
  12. Speicher, N.K. and Pfeifer, N., "Integrating different data types by regularized unsupervised multiple kernel learning with application to cancer subtype discovery", Bioinformatics, Vol. 31, No. 12, (2015), i268-i275. doi: 10.1093/bioinformatics/btv244.
  13. Speicher, N.K. and Pfeifer, N., "An interpretable multiple kernel learning approach for the discovery of integrative cancer subtypes", arXiv preprint arXiv:1811.08102, (2018). doi: 10.48550/arXiv.1811.08102.
  14. Tao, M., Song, T., Du, W., Han, S., Zuo, C., Li, Y., Wang, Y. and Yang, Z., "Classifying breast cancer subtypes using multiple kernel learning based on omics data", Genes, Vol. 10, No. 3, (2019), 200. doi: 10.3390/genes10030200.
  15. Sun, D., Li, A., Tang, B. and Wang, M., "Integrating genomic data and pathological images to effectively predict breast cancer clinical outcome", Computer Methods and Programs in Biomedicine, Vol. 161, (2018), 45-53. doi: 10.1016/j.cmpb.2018.04.008.
  16. Zhang, A., Li, A., He, J. and Wang, M., "Lscdfs-mkl: A multiple kernel based method for lung squamous cell carcinomas disease-free survival prediction with pathological and genomic data", Journal of Biomedical Informatics, Vol. 94, (2019), 103194. doi: 10.1016/j.jbi.2019.1031.
  17. Wilson, C.M., Li, K., Yu, X., Kuan, P.-F. and Wang, X., "Multiple-kernel learning for genomic data mining and prediction", BMC bioinformatics, Vol. 20, (2019), 1-7. doi: 10.1186/s12859-019-2992-1.
  18. Rahimi, A. and Gönen, M., "Discriminating early-and late-stage cancers using multiple kernel learning on gene sets", Bioinformatics, Vol. 34, No. 13, (2018), i412-i421. doi: 10.1093/bioinformatics/bty239.
  19. Rahimi, A. and Gönen, M., "A multitask multiple kernel learning formulation for discriminating early-and late-stage cancers", Bioinformatics, Vol. 36, No. 12, (2020), 3766-3772. doi: 10.1093/bioinformatics/btaa168.
  20. Zohrevand, A., Imani, Z. and Ezoji, M., "Deep convolutional neural network for finger-knuckle-print recognition", International Journal of Engineering, Transactions A:Bascs, Vol. 34, No. 7, (2021), 1684-1693. doi: 10.5829/ije.2021.34.07a.12.
  21. Azimi, B., Rashno, A. and Fadaei, S., "Fully convolutional networks for fluid segmentation in retina images", in 2020 International Conference on Machine Vision and Image Processing (MVIP), IEEE. (2020), 1-7.
  22. Azimi, B., Rashno, A. and Fadaei, S., "Two-path neutrosophic fully convolutional networks for fluid segmentation in retina images", AUT Journal of Modeling and Simulation, Vol. 54, No. 1, (2022), 85-104. doi: 10.22060/miscj.2022.21258.5277.
  23. Chegeni, M.K., Rashno, A. and Fadaei, S., "Convolution-layer parameters optimization in convolutional neural networks", Knowledge-Based Systems, Vol. 261, (2023), 110210. doi: 10.1016/j.knosys.2022.110210.
  24. Hassanpour, M. and Malek, H., "Learning document image features with squeezenet convolutional neural network", International Journal of Engineering, Transactions A:Bascs, Vol. 33, No. 7, (2020), 1201-1207. doi: 10.5829/ije.2020.33.07a.05.
  25. Salimy, S., Lanjanian, H., Abbasi, K., Salimi, M., Najafi, A., Tapak, L. and Masoudi-Nejad, A., "A deep learning-based framework for predicting survival-associated groups in colon cancer by integrating multi-omics and clinical data", Heliyon, Vol. 9, No. 7, (2023). doi: 10.1016/j.heliyon.2023.e17653.
  26. Slimene, I., Messaoudi, I., Oueslati, A.E. and Lachiri, Z., "Deep learning-based cancer disease classification through microrna expression", in 2022 IEEE Information Technologies & Smart Industrial Systems (ITSIS), IEEE. (2022), 1-6.
  27. Hua, J., Xiong, Z., Lowey, J., Suh, E. and Dougherty, E.R., "Optimal number of features as a function of sample size for various classification rules", Bioinformatics, Vol. 21, No. 8, (2005), 1509-1515. doi: 10.1093/bioinformatics/bti171.
  28. Ishwaran, H. and Kogalur, U.B., "Fast unified random forests for survival, regression, and classification (rf-src)", R Package Version, Vol. 2, No. 1, (2019).
  29. Ma, B., Meng, F., Yan, G., Yan, H., Chai, B. and Song, F., "Diagnostic classification of cancers using extreme gradient boosting algorithm and multi-omics data", Computers in Biology and Medicine, Vol. 121, (2020), 103761. doi: 10.1016/j.compbiomed.2020.103761.