of Engineering

The


INTRODUCTION 1
The increasing generation of genomic data and the need to store, retrieve, and properly analyze them led to the emergence of bioinformatics.Bioinformatics deals with mathematical and computational aspects to understand and process biological data.In other words, the aim of bioinformatics is to increase understanding of biological processes through the use of computational techniques [1].
With the significant growth of biological data generation, they play important role in analyzing and resolving problems in medicine such as cancer diagnosis and treatment [2].Before the advent of machine learning methods, bioinformatics algorithms were written manually, which made them difficult to be used in applications such as protein structure prediction [3].Today, machine learning tools and methods are widely used in bioinformatics applications [4].
The dimension of gene expression data is very high, such that may influence the performance of classification algorithms due to the curse of dimensionality.Curse of dimensionality problem is addressed by dimension reduction.Principle Component Analysis is a widely used method to reduce the dimension of data.
To reduce the gene expression dimension, a novel method is employed in this paper.To this end, the features are clustered first, then data is divided into groups such that in each group, data is represented by the corresponding feature cluster.It is worth noting that the number of clusters and the number of data groups is equal.Finally, in their new representation, data are combined with a Multiple Kernel Learning classifier in order to determine the stage of cancer progression.
The key contributions of the proposed algorithm, Feature Clustering Multiple Kernel Learning (FCMKL), are as follows: • The genomic data using for cancer stage detection, which is the main focus of this paper, is gene expression.The dimension of gene expression data is high.To avoid the curse of the dimensionality problem, the features are clustered into smaller groups.By grouping features, the classifier does not suffer from the curse of the dimensionality problem because of the reduced dimension of data.Also, this method does not change or remove features.• For each data group, a kernel matrix is calculated.
Then a weighted linear combination of kernel matrices is computed in a Multiple Kernel Learning framework which is used to detect the cancer stage of the patient.• This paper combines clustering and classification algorithms together to predict the cancer stage of patients.A block diagram of the proposed method is depicted in Figure 1.
This paper is organized as follows.In the second section, related works are reviewed.The third section explains the proposed algorithm in detail.In section four, the experiment results of the proposed algorithm are demonstrated and discussed.The last section concludes the paper.

RELATED WORKS
This section reviews some works related to machine learning based cancer diagnosis and treatment including cancer stage detection.
An integrated model based on logistic regression and support vector machine for the classification of Colorectal Cancer (CRC) into cancerous and normal samples was proposed by Zhao er al. [5].
The method proposed by Bhalla et al. [6]identifies genes to detect the progress of renal cell cancer.For this Figure 1.The block diagram of the proposed method purpose, gene expression data from the KIRC cancer group of TCGA dataset is used.This method is based on the fact that there are only a few genes that are important to determine the stage of cancer.In this method, a threshold value is selected for each gene such that determines whether the desired sample is in an early stage or a late one according to the expression of that gene.Finally, the selected genes were fed to the support vector machine.
Huo et al. [7] used the gene expression data for tumor classification based on the sparsity characteristics of genes.To this end, related genes are selected via the sparse group lasso method.Then, tumors are classified by a support vector machine.Ranjani Rani and Ramyachitra [8] proposed a similar framework for cancer classification.They employed the Spider Monkey Optimization algorithm to select related genes.
After detecting differentially expressed genes, in the framework proposed by Xu etal.[9], a Protein Preotein Interaction (PPI) network based neighborhood scoring technique was used in combination with Support Vector Machine for colon cancer diagnosis and recurrence prediction.
Medjahed et al. [10] employed Support Vector Machine in two phases to select the best gene set of DNA microarray for cancer diagnosis task.A two-stage feature selection method based on Multiple Kernel Learning method was proposed by Du et al. [11] to predict cancer.In the first step of the proposed method, relevant features are identified by Multiple Kernel Learning.In the second step, a subset of features from the set of candidate features obtained in the first step is specified.Data fusion based on Multiple Kernel Learning is proposed by Speicher and Pfeifer [12] to identify cancer subtypes.In order to reduce the dimension of gene data, the proposed method was combined with a graph embedding framework .
A model based on the combination of clustering and Multiple Kernel Learning framework was proposed by Speicher and Pfeifer [13] to identify cancer subtypes.In the proposed model, the features are clustered based on the combination of several kernels, then the effect of each feature cluster on a patient cluster is measured.
The proposed method by Tao et al. [14] deals with the classification of five subtypes of breast cancer based on Multiple Kernel Learning.The data used in this research are gene expression, DNA methylation, and copy number variation from the TCGA dataset.Some genes may have little or no effect on the classification of breast cancer subtypes, which should be identified.For this purpose, the p-values of genes were calculated using the Wilcoxon rank sum.The Benjamini-Hochberg false discovery rate is then determined to adjust the computed p-values.Genes with p-value less than 0.05 are selected as significant genes.
Four types of genomic data in addition to pathological images were used by Sun et al. [15] to predict the survival of breast cancer patients.In the proposed method, Multiple Kernel Learning is employed to integrate different data types .
To predict the survival of patients with squamous cell lung cancer who underwent surgery, a new method based on Multiple Kernel Learning is proposed by Zhang et al. [16].Due to the small number of samples, to deal with the problem of the curse of dimensionality, a linear correlation algorithm is employed to select the optimal features.
Multiple Kernel Learning was used by Wilson et al. [17] to determine the best kernels calculated from two types of data, including clinical data and microRNA from the TCGA dataset.The goal is to predict whether a patient with ovarian cancer would live more than three years after diagnosis or not.
A method to determine the cancer stage using Multiple Kernel Learning was proposed by Rahimi and Gönen [18].In this paper, instead of identifying clusters of gene expression features and computing kernel matrices, it was proposed to combine these two steps into a single model using prior knowledge about pathways and sets of genes.For this purpose, they create a separate kernel matrix for each gene set, then combine them using a Multiple Kernel Learning algorithm.
A set of pathways/genes along with gene data were used by Rahimi and Gönen [19] to detect the cancer stage.Different types of cancers with distinct biological mechanisms, have similarities.In this paper, each cancer group is considered as a specific task.A multi-task learning formula is used in which different tasks are being trained simultaneously.In fact, the goal is to identify similarities between cancer groups (i.e., tasks) in terms of their basic mechanisms.Joint clustering is used for this purpose .
Deep learning based methods generally have very high accuracy in data classification.Zohrevand et al. [20] introduced Convolutional Neural Network, which is a powerful deep learning approach, that was employed to Finger-Knuckle-Print recognition.A Fully Convolutional Network in combination with the graph's shortest layer path has been used for fluid segmentation in retina images [21].Also, a fully automated model was trained by Azimi et al. [22] for fluid segmentation.In this two-path method, the first and last layers of the retina are segmented in the Neutrosophic domain.Then, a Fully Convolutional Network is used for fluid segmentation.Assigning appropriate values for parameters is very important in machine learning based methods.Chegeni et al. [23] proposed a mathematical model to compute the Convolutional Neural Network model parameters automatically.Deep learning based methods suffer from high computational complexity in the training phase and a large number of parameters including weights.To address the mentioned problems a compact version of the Convolutional Neural Network which is called SqueezeNet is employed for document classification while its classification results were comparable to Convolutional Neural Network [24].
Salimy et al. [25] proposed a deep learning framework to predict the survival of colon cancer patients.This method integrates three types of genomic data including gene expression, DNA methylation, and clinical data by autoencoder.Slimene et al. [26] used microRNA for cancer classification.After converting the microRNA data into images, ResNet, which is a pretrained Deep Neural Network is employed to classify data.

PROPOSED FCMKL
This paper focuses on diagnosing the early and late cancer stages by using a gene expression data set.Cancer stage detection is considered as a binary classification problem.In the problems like cancer stage detection in which data samples are usually not separable, the use of the kernel function, which implicitly maps data to a high dimension space, improves the classification accuracy (Figure 2a) .
The dimension of data samples in the gene expression dataset is very high, which degrades the performance of cancer stage classification due to the curse of dimensionality.To address this problem, it is necessary to reduce the data dimension (Figure 2b) .
In order to reduce the data dimension, features are clustered in the first step of the proposed method.The idea is to compute a separate kernel for each cluster of features.After reducing the feature dimension, a Multiple Kernel Learning classifier is trained to classify cancer stages by using the computed kernels (Figure 2c).The ratio of the number of features to the number of samples is determinanat in the classification performance.It should be noted that for a fixed sample size, if the number of features grows, the classification error will decrease first and then will increase [27].In the case that the features are independent, its enough that the number of features does not exceed N-1.As the feature correlation increases, this number decreases such that if the correlation is very high, this number decreases to √ which is used as the number of clusters in the proposed method [27].

Figure 2. Kernel matrix for high dimension Gene Expression data
The architecture of FCMKL is illustrated in Figure 3.In the following sections, the training and testing phases are explained in detail.
Figure 3 illustrates the architecture of FCMKL.First, the dataset is divided into training and testing sets.In the training phase, the features of the training data are clustered.Then, a kernel matrix is computed for each feature cluster.A single kernel is obtained by weighted linear combination of computed kernels.Then, the kernel based Support Vector Machine is trained.In the testing phase, after calculating the kernel corresponding to the testing data based on the feature clusters detected in the training phase, the testing data are classified by the trained support vector machine.

1. Training Phase
Following are the main steps of the FCMKL algorithm.
Step 1: Feature clustering As described before, to address the curse of dimensionality problem in the proposed algorithm, the original data set is divided into smaller ones.To this end, the features are clustered by the k-means clustering algorithm.More precisely, the rows (samples) and columns (features) of the data set are interchanged and given as input to the kmeans algorithm.Kmeans algorithm clusters features based on samples, the output of which is feature clusters.
Step 2: Split the dataset into smaller datasets based on feature clusters Figure 3. FCMKL architecture By using the feature clusters obtained in the previous step, the dataset is divided into smaller subsets such that each subset uses one feature cluster.In each small subset, the rows are samples of the original dataset and the columns are the features in the corresponding cluster.In this way, there will be data subsets that have N samples wit different features.
Step 3: Computing the kernel matrix for each data subset Each data subset is implicitly mapped to a highdimensional feature space by using a separate kernel function.Eventually, for each data subset, an N×N kernel matrix is computed based on the corresponding kernel function.
Step 4: Kernel weighting Before combining kernels calculated in the previous step, we should weight them.To this end, the AUC classification accuracy of each kernel based on Support Vector Machine is computed.Then, weights are assigned to the each kernel by using Equation (1).
where   is the weight of the i th kernel, P is the number of kernels and   is the result of predicting the cancer stage by using i th kernel.
Step 5: Weighted linear combination of kernel matrices In this step, the kernels are linearly combined based on the weights calculated in the previous step.Then the kernel matrices are combined according to Equation ( 2) and a single kernel matrix is created .
Step 6: Training kernel-based support vector machine Finally, by using the combined kernel matrix, a kernel based Support Vector Machine is trained.

2. Testing Phase
The proposed method is evaluated by measuring the testing data classification accuracy.The main steps of the testing phase are as follows: Step 1: Split the testing dataset into smaller datasets based on feature clusters In first step, using the feature clusters obtained in the training phase, the testing set is divided into smaller subsets .

Step 2: Constructing training-testing kernel matrices
The kernel matrices of training-testing data are calculated in this step using training and testing subsets.

Step 3: Weighted linear combination of training-testing kernel matrices
The training-testing kernels computed in the previous step are combined using the weights calculated in the training phase by Equation (2).
Step 4: Classification of testing data using kernel-based Support Vector Machine Finally, by using the combined training-testing kernel matrix, testing data are classified by trained Support Vector Machine.

EXPERIMENTAL RESULTS
In this section, some experiments have been conducted to evaluate the performance of the proposed algorithm using the TCGA dataset.Then, the proposed method is compared with some baseline methods.

1. TCGA Dataset
In the experiments, several groups of cancers available in the TCGA dataset were used to detect the cancer stage.In this dataset, gene expression values of cancer patients, which includes more than 10,000 tumors, are available.In the experiments, HTSeq-FPKM records including primary tumors have been downloaded and used for each disease group.
The TCGA database includes clinical annotations for cancer patients.One of the annotated items, is the degree of cancer progression, which is a number between 1 and 4 for each patient.
Due to the fact that it is clinically significant to distinguish between early and late stages of cancer, in this paper, primary tumors annotated with stage 1 are considered as early stage and the remaining tumors annotated with stages 2, 3, and 4 are considered as late.Disease group information used in this paper is summarized in Table 1.

2. Experiment Settings
For each cancer group, 80% of tumors were selected as training data and the remaining 20% as testing data.The data was divided in such a way that the proportion of positive and negative classes in the training and testing sets is almost equal.
The rane of gene expression value is large.After adding a fixed value, the gene expression values have been converted to a more limited range using log 2. The training set was normalized to have zero mean and standard deviation of one, and then the testing set was as well.
The effiiency of the proposed algorithm is compared with upport Vector Machine, Random Forest, combination of PCA and Support Vector Machine, Deep Neural Network and also Multiple Kernel Learning using Halmark gene dataset which includes 50 gene sets [18].It was extracted from some molecular databases.Each gene set contains information about a specific biological state or a biological process.Rahimi and Gönen [19] divided the gene expression dataset into 50 smaller ones based on the features available in the Hallmark gene set .To implement random forest, the randomForestSRC package was used [28].The number of trees for this algorithm was selected from the set {500, 1000, 1500, 2000, 2500} using 4-fold cross validation.
The code shared by Ma et al. [29] was also used to implement the deep neural network .
To implement Support Vector Machine and Multiple Kernel Learning using Hallmark gene set, code shared by Rahimi and Gönen [18] was used and the MOSEK package is used to solve the quadratic optimization problems 1 .
To compute kernel matrices, Gaussian kernel function was used: (3) (  ,   ) = exp (− ) such that , the kernel width parameter, was set to the average euclidean distance between all pairs of training data.
In the proposed algorithm, the regularization parameter C was set to 1.Moreover, as discussed in section 3, since genomic data have a very high correlation [18,19], the number of clusters should be equal to the ratio of the number of features in the dataset to the square root of the number of training data samples [27].In this way, the features are clustered in such a way that the average number of features in each feature cluster is equal to the square root of the number of training data samples as proposed by Zhang et al. [16].
To compare classification performance of the mentioned algorithms, the evaluation measurement AUC (area under the ROC curve) has been calculated.
To achieve more reliable results, all the experiments were repeated 100 times, and the average of the AUC values were reported (Table 2).
Also, the results of the experiments are illustrated and compared in Figure 4.As Figure 4 shows, the average performance of the algorithms in all datasets is better than random case (in which AUC equal to 0.5).Therefore, the gene expression dataset has significant information about the stages of cancers.
By comparing the classification accuracy of PCA+SVM algorithm with SVM, it is observed that PCA+SVM achieved better results in 8 of 15 cancer groups, while SVM was better in only two groups.The greatest performance improvement of PCA+SVM was in READ cancer group (6%), and the greatest performance reduction was in STAD cancer group (2%).
By comparing the classification accuracy of FCMKL algorithm with RF, it is observed that FCMKL achieved better results in all 15 cancer groups.Performance improvement was significantly better in all groups.For example, compared to RF, FCMKL has improved the classification performance of BRCA by 10%, HNSC by 26%, KICH by 17%, PAAD and COAD by 13%, READ by 8%, STAD by 5%, and TGCT by 9%.
By comparing the classification accuracy of FCMKL algorithm with SVM, it is observed that FCMKL has  The proposed algorithm is implemented by R language.The experiments are conducted on a Windows 10 system, which contains a core i7 CPU with 8 cores and 16GB RAM.The training time of the proposed algorithm varies between 15 minutes to one hour and 10 minutes for different cancer groups.

CONCLUSION
Genomic data are useful in many medical applications including disease diagnosis, prevention and treatment.Cancer is one of the most dangerous and life-threatening diseases in the world and is considered as one of the most important causes of death.It is vital to detect the stage of cancer in a patient because if the disease is detected at an early stage, it will be curable.Also, the type of treatment is different in different stages of the disease.
In this paper, an algorithm, FCMKL, is proposed to improve cancer stage detection using feature clustering based Multiple Kernel Learning.Due to the fact that genomic data have a very high dimension, we are facing the problem of the curse of dimensionality.To address this problem, the features of the original dataset are first clustered based on samples.Then, using feature clusters, the original dataset which has a high dimension is divided into smaller datasets in terms of the number of features.For each of these smaller data sets, a kernel matrix is computed.The kernel matrices are weighted and linearly combined.Finally, using the resulting kernel matrix, the Support Vector Machine is trained to determine the cancer stage.The experiments indicate promising performance of the proposed algorithm .
Employing another clustering algorithms may result in reducing the number of clusters.By reducing the number of clusters, the computation time will decrease.Also, there are another genomic data type like microRNA and DNA methylation which we did not used in our proposed method.By using multimodal data, the classification accuracy will increase.
Feature clustering Split the original dataset into smaller subsets using feature clusters Construct a separate kernel matrix for each data subset Weighted linear combination of kernels Train the classifier using the combined kernel

Figure 2
illustrates three different ways to compute kernel matrix for high dimension Gene Expression data.(a) In this case, a kernel function is used to compute kernel matrix simply.Since the dimension of Gene Expression Data is high, cure of dimensionality problem will reduce classification accuracy in this method.(b) To address the curse of dimensionality problem, it is recommanded to reduce the dimension of data by employing dimension reduction algorithms like PCA before computing kernel.(c) Another approch to reduce the dimension of Gene Expression Data, is to cluster features.This method, which is used in this paper, does not change or remove features.Suppose the dataset contains  data samples and the feature dimension of each data sample is d.The features are clustered into c clusters.Therefore, each cluster contains  data samples which are d/c dimensional.For each d/c-dimensional cluster, a separate  ×  kernel is computed.

Figure 5 .Figure 6 .
Figure 5.The number of cluster members in the FCMKL algorithm for each cancer group is shown in this figure.The black dots represent the clusters and the red dots represent the average number of cluster members in each group.The violin diagram shows the range and distribution of the clusters in terms of the number of their members

TABLE 1 .
Summary of 15 cancer groups in the TCGA dataset

TABLE 3 .
The number of clusters computed in the FCMKL algorit0hm for each cancer group