Title: Diagnostic Classification of Lung Cancer Using Deep Transfer Learning Technology and Multi‐Omics Data
Abstract: Chinese Journal of ElectronicsVolume 30, Issue 5 p. 843-852 Original Research PaperFree Access Diagnostic Classification of Lung Cancer Using Deep Transfer Learning Technology and Multi-Omics Data ZHU Rong, Corresponding Author ZHU Rong [email protected] School of Computer Science, Qufu Normal University, Rizhao, 276826 ChinaSearch for more papers by this authorDAI Lingyun, Corresponding Author DAI Lingyun [email protected] School of Computer Science, Qufu Normal University, Rizhao, 276826 ChinaSearch for more papers by this authorLIU Jinxing, Corresponding Author LIU Jinxing [email protected] School of Computer Science, Qufu Normal University, Rizhao, 276826 ChinaSearch for more papers by this authorGUO Ying, Corresponding Author GUO Ying [email protected] School of Automation, Central South University, Changsha, 410083 ChinaSearch for more papers by this author ZHU Rong, Corresponding Author ZHU Rong [email protected] School of Computer Science, Qufu Normal University, Rizhao, 276826 ChinaSearch for more papers by this authorDAI Lingyun, Corresponding Author DAI Lingyun [email protected] School of Computer Science, Qufu Normal University, Rizhao, 276826 ChinaSearch for more papers by this authorLIU Jinxing, Corresponding Author LIU Jinxing [email protected] School of Computer Science, Qufu Normal University, Rizhao, 276826 ChinaSearch for more papers by this authorGUO Ying, Corresponding Author GUO Ying [email protected] School of Automation, Central South University, Changsha, 410083 ChinaSearch for more papers by this author First published: 01 September 2021 https://doi.org/10.1049/cje.2021.06.006 This work is supported in part by Shandong Social Science Planning Fund Program No. 21BTQJ02, and the National Natural Science Foundation of China under Grant No.61902215, No.61872220. AboutSectionsPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL Share a linkShare onFacebookTwitterLinked InRedditWechat Abstract In recent years, with the increasing application of highthroughput sequencing technology, researchers have obtained and accumulated a large amount of multi-omics data, making it possible to diagnose cancer at the gene expression level. The proliferation of various omics data can provide a large amount of biological information, which brings new opportunities and great challenges as well to cancer classification and diagnosis. Machine learning algorithms for early diagnosis of lung cancer have emerged that distinguish cancers of the early and late stages by using genomic features. Omics data are generally characterized with low sample size, high dimensionality and high noise. Therefore, simple direct application of common classification methods cannot achieve better performance and must be improved in a targeted manner. This paper puts forward a combined convolutional neural network and convolutional auto-encoders approach to construct a deep migratory learning classification model for early lung cancer diagnosis. First, the convolutional auto-encoders algorithm is used to reduce the dimensionality of the dataset in order to make it better meet the requirements of migration learning. Second, a neural network model is constructed with the original dataset and the existing labeled dataset, and the model migration rules are set as well. Finally, a small number of labeled target datasets are used in the training to complete the construction of the classification model. The proposed convolutional neural network method based on model migration and five other popular machine learning models are used to classify and predict the three lung cancer gene datasets and the integrated dataset. The experimental results show that such four evaluation metrics as accuracy, precision, recall, and f1-score with our proposed method have obtained better prediction performance, and the average area under curve result also shows our proposed method is optimal. I. Introduction Lung cancer[1,2] is currently one of the malignant tumors with the highest mortality rate in the world. Lung cancer ranks first in the cause of death from malignant tumors in men, and second in women, second only to breast cancer. Lung cancer can cause great harm. One of the important reasons is that it is latent and the symptoms are not easy to detect when they rise early. Once there are obvious clinical symptoms, most of the patients have entered the middle and advanced stages. At this time, the treatment effect is limited, the cure rate is greatly reduced, and the prognosis effect is also very poor. Therefore, early detection and diagnosis of lung cancer can greatly improve the treatment effect of patients. Lung cancer is generally divided into two categories: small cell lung cancer and non-small cell lung cancer. Nonsmall cell lung cancer is separate into four types. Lung adenocarcinoma (LUAD) and Lung squamous cell carcinoma (LUSC) are the two most common types[3]. In the past few decades, with the unremitting efforts of human beings to fight cancer, the survival rate of cancer patients has been greatly improved, but the survival rate of lung cancer is still relatively low[4,5]. Lung cancer has become one of the cancers with the highest mortality rate, especially LUSC. Therefore, the early diagnosis of lung cancer is more important. With the development of sequencing technology, a large number of omics data have emerged, which has brought opportunities for bioinformatics but also brought challenges to traditional machine learning methods. High throughput sequencing is also known as Deep sequencing (DS) or Next generation sequencing (NGS). NGS refers to the analysis of the arrangement of base sequences in DNA fragments through a certain technical method. NGS is marked by the ability to perform parallel sequencing of hundreds of thousands to millions of DNA molecules at a time and the general read length is relatively short. It is a milestone in the development of cancer research. In more than 40 years, NGS has developed from the first generation to the third generation. This technology not only greatly reduces the cost of DNA sequencing. But also realizes the vision of rapid, efficient, and comprehensive analysis of the human whole genome sequence. With the continuous popularization of computer information management systems in medical institutions and the improvement of digitalization of medical equipment, more patient-related information can be collected and recorded for subsequent treatment and scientific research. However, due to the lack of professional data integration and analysis capabilities, medical data information is difficult to truly be fully utilized. Through the application of computer technology, finding useful knowledge information from a large amount of medical data to better serve the medicine itself is a topic worthy of research. As a hot topic in the medical field, the diagnosis and classification of cancer are the focus of scholars. In recent years, scientists have carried out large-scale tumor sequencing projects such as The cancer genome atlas (TCGA)[6,7] and obtained unprecedented amounts of cancer genome data. Using computational methods and intelligent data mining tools to decipher these massive data and identify the underlying laws will have a positive effect on understanding the pathogenesis of cancer, designing effective drugs for cancer treatment, etc., so that cancer patients can be discovered and obtained in time. Effective treatment becomes possible. In recent research work, some classification methods based on machine learning have been used to identify differentially expressed genes in gene expression data, to make an early diagnosis of lung cancer. Paul et al.[8] used multiple rule sets of genetic programming for classification, and proposed a majority voting genetic programming classifier. Draminski et al.[9] proposed the use of Monte Carlo feature selection to select the features to be included in the data for a given supervised classification task for the analysis of microarray gene expression data. Broet et al.[10] proposed a statistical scoring method based on microarray data to identify gene expression characteristics that separate the early stage from the late stage. Huang et al.[11] proposed the use of deep transfer convolutional neural networks and extreme learning machines to diagnose lung nodules on CT images. Koike et al.[12]used machine learning-based models to classify to predict the recurrence of peripheral lung squamous cell carcinoma. Hu Chen et al.[13] connected an autoencoder and a deconvolution network for low-dose CT imaging. Eraslan et al.[14] used a depth counting autoencoder to denoise single-cell RNA sequences. Xu et al.[15] used Stacked sparse autoencoder (SSAE) for nuclear detection of histopathological images of breast cancer. Kun-Hsing et al.[16] used machine learning methods and microscopic pathological image features to predict the cancer prognosis of non-small cell lung cancer. Wang Xiangxue et al.[17] proposed a computational histomorphometric image classifier that extracts nuclear features from digital H & E images to predict the early recurrence of NSCLC. ZHOU Tao et al.[18] studied the residual neural network and its application on medical image processing. Li Jiangyun et al.[19] propose an improved Faster Region-based Convolutional neural network (Faster R-CNN) employed to detect and locate polyps, and even achieve a multi-object task for polyps in the future. Danaee et al.[20] proposed the use of autoencoders combined with gene expression profile data to extract key feature genes. The occurrence of cancer is not only related to genes but also related to other regulatory methods. The rational use of a variety of omics information to predict cancer will help to obtain more comprehensive information, thereby improving the accuracy of classification. However, omics data generally have the problems of a few samples, high dimensionality, high noise. Because of these characteristics of the existence of omics data, simply applying ordinary classification methods directly cannot achieve better performance, and targeted improvements must be made. In this research, we use the combination of Convolutional neural network (CNN) and Convolutional auto-encoders (CAE) to propose a two-stage adaptive domain transfer method for the training classification method, we call this method CC2DT. First, to make the dataset better meet the requirements of migration learning, the CAE algorithm dataset is used for dimensionality reduction processing. Secondly, a neural network model is constructed using the original dataset and the existing label dataset, and the model migration rules are set. Finally, a small number of labeled target datasets are used for training to complete the construction of the classification model. We compared the CC2DT method with the current state-of-art classification algorithm, and the final results are all CC2DT optimal. II. Methods Early diagnosis of lung cancer is very significant for the development of new prevention and treatment strategies. Standard machine learning algorithms can be used to distinguish early and late cancers based on genomic features. At present, some existing machine learning algorithms have achieved some satisfactory predictive performance in the classification and diagnosis of lung cancer, but due to the high correlation of genomic data, their knowledge extraction capabilities will still be greatly limited. Therefore, it is very necessary to study more effective calculation methods to raise the predictive performance of lung cancer diagnosis. To use deep learning methods and multi-omics data to diagnose and classify lung cancer, the diagnosis of lung cancer is regarded as a binary classification task. We have implemented all the model methods using Python programming. The framework of CC2DT method is shown in Fig.1. Fig. 1Open in figure viewerPowerPoint The framework of CC2DT method In this algorithm, the dataset is first split into two parts: the training dataset and the test dataset. In the second step, for the training dataset, a classification model was constructed using CNN and CAE technology. The third step is to use the grid search method to find the best hyperparameters to optimize the model. The fourth step is to evaluate the predictive performance of the final model. 1. CNN The convolutional neural network is a multilayer perceptron with a special structure. In a convolutional layer, each node is only connected to several neurons adjacent to it and has no connection to other units, thus forming multiple two-dimensional planes. It is a locally connected network, and this process is the process of convolution. The connected nodes use the same weight so that the parameters of the convolutional neural network model are reduced exponentially, and the model can be simplified at the same time. The core component of the neural network is the layer, which is a data processing module, think of it as the data filter. Enter some data and the data that come out becomes more useful. Most deep learning links simple layers to achieve progressive data distillation. A deep learning model is like a sieve for data processing, including a series of increasingly refined data filters (that is, layers). CNN[21] is usually made of the input layer, convolutional layer, pooling layer, fully connected layer, and softmax layer. 1) Convolutional layer The convolutional layer is the kernel part of the CNN. The function of the convolutional layer is to perform convolution operations on the input layer to extract the feature maps that can be used by the next layer. The convolutional layer extracts the features of the input data layer by layer from bottom to top. In a convolutional neural network, each convolutional layer may contain multiple feature maps, which are obtained by convolution operations and activation functions. The convolution operation in the convolutional neural network can be regarded as the inner product operation between the input sample and the convolution kernel. In each convolutional layer, the input data from the previous layer is convolved, and the convolution result is calculated through the activation function to obtain the corresponding feature map, and then the obtained features are passed to the next layer as the next layer then continue the convolution operation on the input data. Each convolution layer can have one or more different convolution kernels, and each convolution kernel will correspond to a feature. In CNN, a convolutional layer can contain many convolutional surfaces. For example, the input data is a matrix x of size M × N , the convolution kernel is a matrix w of size m × n , and the offset is b, then the following formula can be used to calculate each element in the convolution result: h = J x * w + b (1)where “ *” represents convolution operation. J ( · ) represents the activation function. Now the most commonly used activation function in CNN is the Rectified linear unit (ReLU). Here we also employ ReLU. The ReLU layer refers to the activation function of the neuron, which is used to transform the input data linearly or non-linearly. The formula of ReLU is as follows: Re L U ( x ) = max ( 0 , x ) (2) 2) Pooling layer The pooling layer is also called under-sampling or downsampling. The purpose of the pooling operation is to decrease the feature dimension, compress the number of data and parameters, and avoid overfitting. Simply put, the role of the pooling layer is to reduce the amount of data processing and to retain useful information at the same time. In the pooling layer, a region is generally selected, and a new feature vector is generated according to the feature vector of this region. This process is called a pooling operation. Under normal circumstances, after the pooling operation is performed, the dimension of the new feature vector obtained can be effectively reduced, so that the subsequent calculation is simplified and more robust. In the traditional CNN, after the convolutional layer is pooled (usually maximum pooling), the result is usually flattened and sent to two or more fully connected layers, and then the softmax function is used to compute the probability value of each category in the sample. The category of the sample is the maximum probability value obtained. However, multiple fully connected layers face a problem: there are many parameters, which will lead to many parameters of the entire network, slow calculation speed, complex parameter update, low efficiency, and even overfitting. Global pooling is to make the sliding window size of pooling equal to the size of the feature map, and one feature map outputs one value. The advantage of global pooling on a fully connected layer is that there is no need to set parameters for global pooling, but only the average or maximum value of the entire feature map, which can effectively reduce training time. Moreover, global pooling does not need to adjust the parameters according to the optimization algorithm, which can well avoid the occurrence of overfitting. Besides, global pooling can also gather spatial information, so it is more robust to input spatial transformation. Global pooling is divided into two types: Global average pooling (GAP) and Global max pooling (GMP). GMP only extracts the most important area in each feature map, while GAP will consider each area in the feature map and finally average. 3) Fully connected layer After multi-layer convolution and pooling, the abstracted high-level features can be classified through the fully connected layer. The calculation formula is as follows: O = soft max W * X + b (3)where W represents the weight of the fully connected layer. X represents the feature vector after convolution pooling. b represents the bias of the fully connected layer. After the matrix is calculated, the softmax function limits the value of each element of the result to between, and the sum is 1. 4) Weight adjustment The early neural network is the perceptron proposed by Rosenblatt[22] and others, which can automatically obtain a combination of samples based on training samples. The perceptron has been able to automatically determine the parameters through training. The training method is a supervised learning method. The training dataset must have known output results, and then the model is trained based on these output results, that is, adjusted according to the set training samples and expected to output the difference between the actual output and the expected output is obtained through this error-correction learning method to obtain the connection weight value between the input layer and the output layer. Generally, the perceptron uses random numbers to initialize various parameters, so the parameters obtained by training may be different each time. Although the perceptron can automatically obtain parameters, it can only solve simple linearly separable problems, but cannot solve linearly inseparable complex problems. Therefore, to solve complex problems that are not linear, researchers have proposed a multilayer perceptron model. A multilayer perceptron is a combination of multiple perceptrons. Because its network propagates forward, it sometimes calls a feedforward network. The layers are connected by weight values. Early multilayer perceptions also used random numbers to determine the connection weight between the input layer and the intermediate layer and then adjusted the connection weight according to the error between the expected output and the actual output of the input data, but only for the multilayer perceptron, the connection weight between the middle layer and the output layer in the model is used for error correction learning. This method sometimes results in different input data but the same output result is obtained, and the correct output result cannot be obtained. To solve this problem, the error Backpropagation algorithm (BP)[23,24] appeared. In the learning process of the BP algorithm, the signal propagates forward and the error propagates backward. Generally, input data is forwarded through layers and passed to the output layer to obtain the actual output result. If the actual output result does not match the expected output result, calculate the error value between the two and enter the process of backpropagation error, the error value propagates backward from the output layer to get the error value of each layer, and finally adjust the connection weight between each layer to reduce the error. Forward propagation and backpropagation are repeated, and the weight of each layer is continuously adjusted until the output error reaches the minimum, and finally, an optimal output result is obtained. The adjustment of the weight generally uses the Gradient descent method (GDM), the GDM is to determine the adjustment value of the connection weight by calculating the error value between the expected output and the actual output and the related gradient value and obtain a new Connect the weight value, and then continuously adjust the weight value to minimize the error, thereby obtaining the optimal connection weight value. The calculation of the error value between the expected output and the actual output generally uses the least square error function. 2. CAE AutoEncoder (AE) is a typical unsupervised neural network model with an input equal to output. After training, the autoencoder can copy the input to the output with certain accuracy. The autoencoder can automatically learn the features in the data and try to copy the input information to the output. However, there will be information loss during the encoding process, so the input information can only be copied approximately. The self-encoder is an algorithm for data compression, and its encoding and decoding process is the process of data compression and decompression. In deep learning, the features generated by the self-encoding network are often used to replace the original data to achieve better results. The autoencoder is a simple three-layer neural network, namely input layer, hidden layer, and output layer. Taking the hidden layer as the boundary, the encoder on the left and the decoder on the right. In the training process, the input is encoded and then decoded, that is, the input is compressed as a feature, and then the feature is restored to the input. Among them, the hidden layer is the core of the entire autoencoder, and its number of neurons is much lower than that of the input layer, which is equivalent to using fewer features to express the input data, thereby achieving the purpose of reducing the data dimension. The autoencoder uses the encoder to encode the input data set, and then through the decoder, the input sample data set is reconstructed. The autoencoder can make the output value approximate to the input value to the greatest extent. Based on the unsupervised learning method of the traditional autoencoder, CAE combines the convolutional pooling operation in the convolutional neural network to achieve feature extraction, which can well retain the information of the signal and at the same time, It also effectively improves the training speed. The main function of CAE is to make corresponding convolutional encoding and decoding of input data, use convolution and down-sampling encoding to discover the features, and then use reverse decoding to restore the data as it is. The convolutional autoencoder requires that the relevant parameters must be learnable during the training process. Based on the self-encoder of deep learning, the encoder and decoder usually do convolution and deconvolution operations, and then calculate the error between the image data before encoding and after decoding, and use the backpropagation method to continuously adjust during the training process. The value of the parameter updates the error range until it reaches the minimum. CAE combines the convolution and pooling operations of CNN based on the traditional unsupervised learning method of self-encoder, to realize the effective extraction of features[14]. The autoencoder is a three-layer feedforward neural network for learning representation[25]. CAE is composed of the input layer, hidden layer, and output layer. It learns the feature expression of the original data by minimizing the reconstruction error[26]. CAE passes through two operating steps of encoding and decoding, which can minimize the reconstruction error of the input data, thereby obtaining the best expression of the data hidden layer. The encoder f ( x ) maps the input data x ∈ R n to the hidden layer feature h ∈ R m , which is expressed as follows: h = f ( x ) = J W x + b (4)where W ∈ R m × R n represents the weight matrix of the encoder. b ∈ R m denotes the offset vector of the encoder. J denotes the activation function of the encoder. In contrast, the decoder g ( h ) is used to map the hidden layer feature h back to the input to obtain the reconstruction result x ^ , which is expressed as follows: x ^ = g ( h ) = J W ' h + b ' (5)where W ' ∈ R n × R m represents the weight matrix of the decoder. b ' ∈ R n represents the offset vector of the decoder. J represents the decoder activation function. The reconstruction error S ( x , x ^ ) represents the difference between the input data x and the reconstruction result x ^ , expressed as follows: S ( x , x ^ ) = ∥ x - x ^ ∥ 2 (6) The optimal solution of the above formula can be solved by methods such as the gradient descent method, thus realizing the construction of CAE. III. Experimental Results The classification and analysis of cancer using omics data mainly start with single omics data. The method of using single omics data to classify cancer mainly focuses on the application of gene expression data to diagnose and type cancer. This has always been biological information One of the research hotspots in the field. Restricted by microarray experiments, gene expression data sets generally have the characteristics of small samples, high dimensions, high noise, and uneven sample distribution. For this reason, compared with traditional classification problems, it is more difficult to use gene expression data for classification. At present, the classification of cancers based on single omics is mainly carried out by studying the selection of characteristic genes and the design of classifiers. In the experiment, the classification method combining CAE and CNN was tested and compared with the classification results of Support vector machines (SVM), Random forest (RF), Linear discriminant analysis (LDA), Extra trees (ET), and Multi-layer perceptron (MLP) classification methods. The classification test adopts a 10-fold cross-validation method. During the test, a grid search algorithm is used to select optimization parameters, and the test results of different methods are compared. Due to the small amount of data in the training set and the test set, and the high dimensionality of the multi-omics data, to prevent overfitting during the training process, the Gradient descent method (GDM) fine-tunes the weight of the system model, which can effectively prevent overfitting during the training process, and may also improve the classification accuracy. 1. Data collection Three different types of omics data of LUSC were used in the experiment: mRNA expression, miRNA-seq data, and DNA methylation data. The experimental data comes from TCGA, which has been processed and provided by Baoshan Ma et al.[27]. A brief description of experimental datasets is described in Table 1. Table 1. Statistics of experimental datasets Data Sets Size LUSC methy 362 × 16693 LUSC mirna 362 × 449 LUSC mrna 362 × 16760 Integrate multiple data 418 × 19877 2. Evaluation metrics We evaluate the performance of the CC2DT method employs a 10-fold cross-validation (10-fold-cv) algorithm. In the algorithm, we first use the grid search method to select the best hyperparameters to optimize the model. Then, the model and the parameters that achieve the best performance are fitted to the training set and evaluated on the test set. In this article, we first use accuracy, accuracy, recall, and f1-score to evaluate the effect of the algorithm. The calculation formulas for the four evaluation indicators of accuracy, precision, recall, and f1-score are expressed as follows. accuracy = T P + T N T P + F N + F P + T N (7) precision = T P T P + F P (8) recall = T P T P + F N (9) f 1 – score = 2 × accuracy accuracy + recall (10)where TP is the number of samples that are actually positive samples and also correctly classified as positive samples by the classifier. TN is the number of samples that are actually negative samples and are correctly classified as negative samples by the classifier. FP is the number of samples that are actually negative samples but incorrectly classified as positive samples by the classifier. FN is the number of samples that are actually positive samples but incorrectly classified as negative samples by the classifier. The higher are the accuracy, the precision, the recall, and the f1-score, the better is the classification performance. 3.