Diagnosis of Gastric Cancer via Classification of the Tongue Images using Deep Convolutional Networks
Subject Areas : Image ProcessingElham Gholami 1 , Seyed Reza Kamel Tabbakh 2 , Maryam khairabadi 3
1 -
2 -
3 -
Keywords: Gastric Cancer, Deep Convolutional Networks, Image Classification, Fine-grained Recognition,
Abstract :
Gastric cancer is the second most common cancer worldwide, responsible for the death of many people in society. One of the issues regarding this disease is the absence of early and accurate detection. In the medical industry, gastric cancer is diagnosed by conducting numerous tests and imagings, which are costly and time-consuming. Therefore, doctors are seeking a cost-effective and time-efficient alternative. One of the medical solutions is Chinese medicine and diagnosis by observing changes of the tongue. Detecting the disease using tongue appearance and color of various sections of the tongue is one of the key components of traditional Chinese medicine. In this study, a method is presented which can carry out the localization of tongue surface regardless of the different poses of people in images. In fact, if the localization of face components, especially the mouth, is done correctly, the components leading to the biggest distinction in the dataset can be used which is favorable in terms of time and space complexity. Also, since we have the best estimation, the best features can be extracted relative to those components and the best possible accuracy can be achieved in this situation. The extraction of appropriate features in this study is done using deep convolutional neural networks. Finally, we use the random forest algorithm to train the proposed model and evaluate the criteria. Experimental results show that the average classification accuracy has reached approximately 73.78 which demonstrates the superiority of the proposed method compared to other methods.
[1] B. R. Bistrian, “Modern Nutrition in Health and Disease (Tenth Edition),” Crit. Care Med., vol. 34, no. 9, p. 2514, Sep. 2006, doi: 10.1097/01.CCM.0000236502.51400.9F.
[2] R. A. Smith et al., “American Cancer Society Guidelines for the Early Detection of Cancer,” CA. Cancer J. Clin., vol. 52, no. 1, pp. 8–22, Jan. 2002, doi: 10.3322/canjclin.52.1.8.
[3] G. Murphy, R. Pfeiffer, M. C. Camargo, and C. S. Rabkin, “Meta-analysis Shows That Prevalence of Epstein–Barr Virus-Positive Gastric Cancer Differs Based on Sex and Anatomic Location,” Gastroenterology, vol. 137, no. 3, pp. 824–833, 2009, doi: https://doi.org/10.1053/j.gastro.2009.05.001.
[4] H. H. & J. M. Azizi F, “Epidemiology and control of common diseases in Iran,” 3th ed. Tehran Khosravi Publ., pp. 45–7, 2010.
[5] P. Bertuccio et al., “Recent patterns in gastric cancer: A global overview,” International Journal of Cancer, vol. 125, no. 3. Wiley-Liss Inc., pp. 666–673, Aug. 01, 2009, doi: 10.1002/ijc.24290.
[6] A. Baranovsky and M. H. Myers, “Cancer Incidence and Survival in Patients 65 Years of Age and Older,” CA. Cancer J. Clin., vol. 36, no. 1, pp. 26–41, Jan. 1986, doi: 10.3322/canjclin.36.1.26.
[7] J. D. Emerson and G. A. Colditz, “ Use of Statistical Analysis in The New England Journal of Medicine ,” N. Engl. J. Med., vol. 309, no. 12, pp. 709–713, Sep. 1983, doi: 10.1056/nejm198309223091206.
[8] et al. Hasanzadeh, J., “No TitleGender differences in esophagus, stomach, colon and rectum cancers in fars, Iran, during 2009-2010: an epidemiological population based study,” J. Rafsanjan Univ. Med. Sci., vol. 12.5, pp. 333-342., 2013.
[9] N. Zhang, J. Donahue, R. Girshick, and T. Darrell, “Part-Based R-CNNs for Fine-Grained Category Detection BT - Computer Vision – ECCV 2014,” 2014, pp. 834–849.
[10] H. H. Hartgrink, E. P. M. Jansen, N. C. T. van Grieken, and C. J. H. van de Velde, “Gastric cancer,” Lancet, vol. 374, no. 9688, pp. 477–490, 2009, doi: https://doi.org/10.1016/S0140-6736(09)60617-6.
[11] S. Sarebanha, A. Hooman Kazemi, P. Sadrolsadat, and N. Xin, “75 Comparison of Traditional Chinese Medicine and Traditional Iranian Medicine in Diagnostic Aspect,” Mar. 2016. Accessed: Nov. 19, 2020. [Online]. Available: http://jtim.tums.ac.ir.
[12] J. Hu, S. Han, Y. Chen, and Z. Ji, “Variations of Tongue Coating Microbiota in Patients with Gastric Cancer,” Biomed Res. Int., vol. 2015, 2015, doi: 10.1155/2015/173729.
[13] T. Ma, C. Tan, H. Zhang, M. Wang, W. Ding, and S. Li, “Bridging the gap between traditional Chinese medicine and systems biology: The connection of Cold Syndrome and NEI network,” Molecular BioSystems, vol. 6, no. 4. The Royal Society of Chemistry, pp. 613–619, Mar. 17, 2010, doi: 10.1039/b914024g.
[14] X. Liu et al., “The Metabonomic Studies of Tongue Coating in H. pylori Positive Chronic Gastritis Patients,” Evidence-Based Complement. Altern. Med., vol. 2015, p. 804085, 2015, doi: 10.1155/2015/804085.
[15] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The caltech-ucsd birds-200-2011 dataset,” 2011.
[16] X. Liu et al., “The Metabonomic Studies of Tongue Coating in H. pylori Positive Chronic Gastritis Patients,” Evidence-based Complement. Altern. Med., vol. 2015, 2015, doi: 10.1155/2015/804085.
[17] C.-C. Chiu, “A novel approach based on computerized image analysis for traditional Chinese medical diagnosis of the tongue,” Comput. Methods Programs Biomed., vol. 61, no. 2, pp. 77–89, 2000, doi: https://doi.org/10.1016/S0169-2607(99)00031-0.
[18] S. Branson et al., “Visual Recognition with Humans in the Loop BT - Computer Vision – ECCV 2010,” 2010, pp. 438–451.
[19] M. Moghimi, “Using color for object recognition,” Calif. Inst. Technol. Tech. Rep, 2011.
[20] E. Gavves, B. Fernando, C. G. M. Snoek, A. W. M. Smeulders, and T. Tuytelaars, “Fine-Grained Categorization by Alignments,” pp. 1713–1720, 2013, doi: 10.1109/ICCV.2013.215.
[21] X. Zhang, H. Xiong, W. Zhou, and Q. Tian, “Fused One-vs-All Features with Semantic Alignments for Fine-Grained Visual Categorization,” IEEE Trans. Image Process., vol. 25, no. 2, pp. 878–892, Feb. 2016, doi: 10.1109/TIP.2015.2509425.
[22] B. Yao, A. Khosla, and L. Fei-Fei, “Combining randomization and discrimination for fine-grained image categorization,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2011, pp. 1577–1584, doi: 10.1109/CVPR.2011.5995368.
[23] B. Yao, G. Bradski, and L. Fei-Fei, “A codebook-free and annotation-free approach for fine-grained image categorization,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2012, pp. 3466–3473, doi: 10.1109/CVPR.2012.6248088.
[24] N. Zhang, R. Farrell, and T. Darrell, “Pose pooling kernels for sub-category recognition,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2012, pp. 3665–3672, doi: 10.1109/CVPR.2012.6248364.
[25] Y. J. Lee, A. A. Efros, and M. Hebert, “Style-aware mid-level representation for discovering visual connections in space and time,” in Proceedings of the IEEE international conference on computer vision, 2013, pp. 1857–1864.
[26] T. Berg and P. N. Belhumeur, “Poof: Part-based one-vs.-one features for fine-grained categorization, face verification, and attribute estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 955–962.
[27] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “CNN features off-the-shelf: an astounding baseline for recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2014, pp. 806–813.
[28] N. Zhang, R. Farrell, F. Iandola, and T. Darrell, “Deformable part descriptors for fine-grained recognition and attribute prediction,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 729–736.
[29] J. Donahue et al., “Decaf: A deep convolutional activation feature for generic visual recognition,” in International conference on machine learning, 2014, pp. 647–655.
Diagnosis of Gastric Cancer via Classification of the Tongue Images using Deep Convolutional Networks
Department of Computer Engineering, Neyshabur branch, Islamic Azad University, Neyshabur,Iran gholami.elh@gmail.com Seyed Reza Kamel Tabbakh * Department of Computer Engineering, Mashhad Branch,Islamic Azad University, Mashhad, Iran rezakamel@computer.org Maryam Kheirabadi Department of Computer Engineering, Neyshabur Branch,Islamic Azad University, Neyshabur,Iran maryam.abadi@gmail.com
Received: 09/ Sep /2020 Revised: 01/May/2021 Accepted: 04/Jun/2021 |
|
Abstract
Gastric cancer is the second most common cancer worldwide, responsible for the death of many people in society. One of the issues regarding this disease is the absence of early and accurate detection. In the medical industry, gastric cancer is diagnosed by conducting numerous tests and imagings, which are costly and time-consuming. Therefore, doctors are seeking a cost-effective and time-efficient alternative. One of the medical solutions is Chinese medicine and diagnosis by observing changes of the tongue. Detecting the disease using tongue appearance and color of various sections of the tongue is one of the key components of traditional Chinese medicine. In this study, a method is presented which can carry out the localization of tongue surface regardless of the different poses of people in images. In fact, if the localization of face components, especially the mouth, is done correctly, the components leading to the biggest distinction in the dataset can be used which is favorable in terms of time and space complexity. Also, since we have the best estimation, the best features can be extracted relative to those components and the best possible accuracy can be achieved in this situation. The extraction of appropriate features in this study is done using deep convolutional neural networks. Finally, we use the random forest algorithm to train the proposed model and evaluate the criteria. Experimental results show that the average classification accuracy has reached approximately 73.78 which demonstrates the superiority of the proposed method compared to other methods.
Keywords: Gastric Cancer; Deep Convolutional Networks; Image Classification; Fine-grained Recognition.
1- Introduction
Cancer is the second leading cause of death after cardiovascular diseases throughout the world [1]. Gastric cancer has a high mortality rate [2] and is the fourth prevalent cancer type, as well as the second most fatal cancer [3, 4]. However, the emergence of gastric cancer has decreased, especially in developed countries [5, 6]. In Iran and unlike most developed countries, the emergence of gastric cancer is on the rise, which is particularly significant in the north and northwest of Iran [8].
Traditional Chinese medicine (TCM) is a form of natural and comprehensive healthcare system dating back to 3-5 thousand years ago [13]. TCM has historically been used for the treatment of various diseases in East Asia and is known as a complementary and alternative medical system in western countries [15]. In TCM, diseases are diagnosed based on the information obtained through observing, hearing, smelling, and touching. Most diagnoses are based on pulse measurement and tongue examination [18].
In Chinese medicine, the tongue is essentially examined as it is located in the mouth and is not affected by external and environmental factors [19]. Recently, several studies have investigated gastric cancer diagnosis using the images of the color and texture of the tongue [14, 16, 18]. Object, texture, and component recognition in an image are the most important criteria in image processing, and the main challenge in this regard is the diversity of natural images, which results from differences in objects and cameras, illumination variations, movement, metamorphosis, and background congestion. In this study, we used an approach based on recent developments in deep learning for the visual recognition of tongue images, and a new method was proposed based on deep convolutional networks and random forest to solve fine-grained image classification, which could be applied in other areas than the detection of tongue texture images.
2- Literature Review
Recently, significant developments have been achieved in image classification. Image classification has also become a commercial and applied issue within the past decade rather than a research subject. In the present study, we initially evaluated the basic classification method and achieved developments. Fine-grained classification has recently attracted the attention of researchers as well. For instance, a human recognizes a chair by recognizing its components, such as the stands and back. The ability to recognize the side level is associated with the ability to discriminate between similar objects; this observation inspired our proposed method and other approaches.
In a study in this regard, Branson et al. assessed object classification using a semi-automatic method where the user would be asked about the object for a given set of images, and the type of a bird was recognized based on the user's response. In the mentioned study, the accuracy of the applied method was reported to be 19% [20]. On the other hand, Wlinder et al. performed automatic classification using color histogram characteristics and the KNN classifier, reporting low accuracy [21]. To increase the accuracy of the method, Moghimi extracted features from the area where the probability of the presence of an object was high. In the mentioned study, the area where an object was present was selected manually, ultimately resulting in the accuracy of 18.9% [22].
Zhang et al. proposed a method to significantly match the sections and feature extraction with a smaller dimension. Initially, the important areas were selected using the selective search algorithm. Following that, the SVM classifier was used to detect the areas with the maximum score, which resulted in the accuracy of 82.8% [24]. In another research, machine learning was applied to determine the components and extract features, and the optimal results were obtained with the CUB-200-2011 database.
The methods used in the aforementioned studies could be classified into three categories. The first category includes the primary methods with the classification accuracy of 10-30%, in which conventional classification methods are often used for fine-grained classification without remarkable results [17, 23, 25, 26]. The second category includes the methods that are used to better recognize the granular classification problem with the reported accuracy of 40-60%; the low accuracy of these methods might be due to non-deep features. The third category contains the methods based on deep learning, which are used to solve this problem with the accuracy of 80-90% [12, 37, 41]. The proposed method in our study has been classified into the third category.
3- The Proposed Method
We proposed a method to locate the tongue surface independent of various gestures in images. If the face components (especially the mouth) are located correctly, the component resulting in a higher discrimination of the dataset can be used, which is cost-efficient in terms of tempo-spatial complexity. Since we had the optimal estimate of the components, the optimal features could be extracted, and the best possible accuracy was calculated as well. The initial image of the tongue was with the face and head, and the dimensions of the initial raw image were 2988×5312 pixels. Figure 1 shows an example of a raw (initial) image of the healthy image category.
Fig. 1. Example of Initial Image
As mentioned earlier, the problem of fine-grained image classification in computer vision has been resolved by deep convolutional networks, which are highly diverse and very similar to each other in terms of structure. In the present study, we applied the deep network structure of AlexNet, which consists of eight main layers. Figure 2 depicts a schematic view of the selected network.
Fig. 2. AlexNet Network Structure
Notably, we had a different view of neural networks in this study. When it comes to neural networks, they are often viewed as 1D connected layers, while in convolutional neural networks, the layers are considered as 3D information. Training is considered to be a major process of deep convolutional networks. Since these networks have more than 120 million parameters, their training is rather difficult due to over-fitting problems. In addition, forward pass for calculating the values of all the network nodes layer by layer using the input information and backward pass for calculating the errors and network learning are time-consuming processes. The network retuning method is aimed at applying deep learning methods to small databases.
The fine-tuning of a network is used to enhance transmission learning, with the new database used for learning. In this method, a pre-trained network is used on a database with more images (e.g., ILSVRC) as the initial values for another network that is similar to the target network where the only difference is in the probability generator layer. The target network differs from the origin network only in terms of the number of the outputs of the probability layer. Therefore, the weights of the probability layer should be estimated again. On the other hand, the layers before the probability layer in the target network could be initialized using the learned weights of the corresponding layers in the original network. As a result, learning is carried out using the data of the target database.
Our proposed method was divided into training and test steps, which have been further discussed below.
3-1- Training Step
The training step had three main stages, as follows:
1. Random forest training: A rectangle represented the location of each element of the image and used for training. The random forest model could distinguish the pixels inside the rectangle from those outside the rectangle.
The required training samples for random forest training were the pixels of a component and the pixels not belonging to the component. In addition, each rectangle could be considered as a component; such examples are the peripheral rectangle of the individual's mouth and the peripheral rectangle of the tongue (Figure 3).
Fig. 3. Different Examples of Random Forest Training (In the image on the right, the red rectangle is the peripheral rectangle of the individual's tongue; the red dots in the red rectangle belong to the component, and the blue dots do not belong to the component; in the middle image, the yellow dots belong to the component; in the image on the left, the green rectangle is the peripheral rectangle of the individual's face, and the positive points inside belong to the component)
The training data required by the random forest were obtained using the following equation:
.
where shows the ith data belonging to one pixel of the images of the database and the component of interest, while the ith data does not belong to the component of interest. In Figure 3, the blue dots are the negative points (), and the red ones () are the positive points.
The algorithm used to generate the training data of A was implemented in several steps. For each training image, the peripheral rectangle of the components was obtained, and its zoning was also prepared. The deep features of each pixel were calculated as well. An arbitrary number of pixels (20 pixels) were generated inside the peripheral rectangle using random or uniform approaches. Each pixel inside the tongue region was added to set A, and its deep feature were considered as positive data. Moreover, an arbitrary number of pixels (100 pixels) were generated in the entire image using random or uniform approaches. Each pixel outside the peripheral rectangle and its deep features were also added to set A as negative data.
After completing the training dataset A, the model was trained using the random forest. Due to using one forward pass in the neural network and the random forest (including 10 decision trees with maximum depth of 10), the likelihood estimate of the membership of all the pixels could be calculated rapidly.
2. Retuning the deep convolutional neural network (DCNN): Three DCNNs were used for piecewise feature extraction and retuned for the entire image, mouth image, and tongue image. The purpose of using the features of the layers of these networks was for one feature vector to be generated per each pixel. To calculate the deep pixel features of each image or image section, the image was fed as input to the network, and forward pass was calculated once. Following the forward pass, the values of the features of the input image were also calculated in all the images and referred to as the feature channels (Figure 4). As is shown in Figure 4, the feature channels had a specific size, which was changed by up-sampling. Notably, one of the limitations of this method was the size of the input network. Currently, the input image size should be 227×227 to use the employed network.
After changing the size of the feature channels to the input image size, all the feature channels were inserted into the main channel to constitute a column of feature channels with the length of 1,376.
Fig. 4. Calculation of Feature Channels in Different Layers (After giving an arbitrary image to the input, all the feature channels of the middle layers or the data values in the middle layer were calculated by forward pass; the size of the channels was changed to the size of the input (up-sample); 1,376 different channels were obtained; the total depth of layers conv1 to conv5 in the AlexNet network was 1,376 with the same dimension as the input image.)
The corresponding random forest was used to retune the mouth and tongue area, and the estimated peripheral rectangle was extracted from the training data. At the next stage, this image section was cut, and retuning was performed. To extract the piecewise features, the following steps were taken:
· We selected the image section from which the features had to be extracted.
· The selected image section was resized to the dimension of the input data (222×227), and the resized image was fed as input to the network.
· By calculating the forward pass, the values of the layers were calculated for the network input.
· The values of the data in the fc7 layer (4,096-dimensional) were retuned as a feature.
The proposed method could generate an excellent 4,096-dimensional feature vector for the arbitrary segments of multiple images or one image. The feature vector could be fed as input to classifiers such as the SVM.
3. Classifier training: For the final classifier, the one-vs.-all method was used; the employed classifier was an SVM with a linear core. The final feature vector of each image was obtained by combining the three-piece deep features extracted from the mouth and tongue of an individual and the entire image.
3-2- Test Step
In the test step, the locations of the components and feature extraction were determined for each test image. Following that, the classifier was used to estimate the test image class. The peripheral rectangle of the tongue, the peripheral rectangle of the mouth area, and the peripheral rectangle of the face were also estimated for each test image using the random forest algorithm. The retuned neural networks were used to extract the three-piece deep features similar to the training step. The final vector was obtained by combining these feature vectors, and the final feature vector was classified as healthy and unhealthy using the classifier.
4- Results
The empirical results of the proposed method have been discussed in this section. For the test, we used the database of the tongue images, which contained 700 images of the patients and 800 images of the healthy subjects. For the training, 500 healthy images and 470 patient images (total: 970 images) were used. In addition, 300 healthy images and 330 patient images (total: 630 images) were used for the test.
Initially, the evaluation metrics of the classification in the database were introduced, and the proposed method was assessed based on these metrics. Notably, we had limited choices for the evaluation of the general image classification and fine-grained image classification. If a computer vision such as object detection, pose estimation or the segmentation of the estimated class were considered equivalent to the real class, the classification would be correct; otherwise, the classification would be incorrect. For more accurate outcomes, the classification accuracy metric was assessed in detail.
We assumed that the test data were in the form of ordered pairs such as , in which is the image, is its label, and is the number of the test images. The following formula was used to calculate the mean accuracy to evaluate the classifier function (f[x]):
In the formula above, C is the number of the classes, and shows the set of the indices of the test data belonging to class C, which would be calculated as:
Notably, represents the number of the test samples belonging to class C, and II(.) is the mathematical indicator function. The value of mA was the mean classification accuracy among different classes.
The proposed deep non-parametric transfer (DNPT) method was evaluated based on various features in the neighboring detection section (Table 1). In addition, the deep feature space was used to detect the neighbor inside the parenthesis. In other words, DNPT(conv3) indicated that the proposed transfer method could be employed by the conv3 feature to detect the neighbors. DNPT(oracle) represents the tests in which the location of the main components was used instead of estimating the location of the components. If a method was available for the accurate estimation of the location of the components, the mean accuracy of the DNPT(oracle) could be obtained as the method provides the upper bound of the mean accuracy for the enhanced transfer method.
Table 1: Results of Non-parametric Transfer Method (Names specified with * represent the proposed methods)