Taxonomy Dimension Reduction for Colorectal Cancer Prediction

A growing number of people suffer from colorectal cancer, which is one of the most common cancers. It is essential to diagnose and treat the cancer as early as possible. The disease may change the microorganism communities in the gut, and it could be an efficient method to employ gut microorganisms to predict colorectal cancer. In this study, we selected operational taxonomic units that include several kinds of microorganisms to predict colorectal cancer. To find the most important microorganisms and obtain the best prediction performance, we explore effective feature selection methods. We employ three main steps. First, we use a single method to reduce features. Next, to reduce the number of features, we integrate the dimension reduction methods correlation-based feature selection and maximum relevance–maximum distance (MRMD 1.0 and MRMD 2.0). Then, we selected the important features according to the taxonomy files. Random forest, naïve Bayes, and decision tree classifiers were evaluated.

  Flow Chart

        

  Dataset

        Two datasets are used in this study, both from Oudah et al. The first dataset (CRC1) contains 90 cancer samples and 92 normal samples. The original feature set contains 18,170 OTUs. The second dataset (CRC2) includes 30 cancer samples and 30 normal samples, with 6,807 OTUs. The study also provides the taxonomy files, which provide the complete classification information for each OTU (including kingdom, phylum, class, order, family, genus, and species).

  Usage

        MRMD 1.0: http://lab.malab.cn/soft/MRMD/index_en.html
        MRMD 2.0: https://github.com/heshida01/MRMD2.0
        CFS: You can find this method in WEKA, and the description is here

  Download

        CRC1 dataset
        CRC2 dataset
        MRMD 1.0
        MRMD 2.0