mAML: an automated machine learning pipeline with a microbiome repository for human disease classification

Clone the repostory from github after you install git.


python version is 3.7.3.

Install the packages required using the command: pip3 install -r requirements.txt.

Alternatively, you can pull the docker image and run the Pipeline in a docker container. Learn how to use docker.


$ usage: --help or -h

A pipeline for automatically identify the best performing combinations of scalars and classifiers for microbiomic data

positional arguments:
  X_file                feature matrix file (required)
  Y_file                map file (required)

optional arguments:
  -h, --help            show this help message and exit
  --outdir OUTDIR, -o OUTDIR
                        path to store analysis results, default='./'
  --prevalence PREVALENCE, -p PREVALENCE
                        filter low within-class prevalence features, default= 0.2
  --mrmr_n MRMR_N       number of features selected with MRMR, default=0
  --over_sampling       {ADASYN,KMeansSMOTE,SMOTE,SVMSMOTE,BorderlineSMOTE,RandomOverSampler,SMOTENC,None}
                        over sampling method for the imbalance data (default: SMOTE)
  --search              tune parameters of each classifier while selecting the best scaler and classifier
  --outer_cv OUTER_CV   number of fold in the outer loop of nested cross validation default=10
  --inner_cv INNER_CV   number of fold in the inner loop of nested cross validation, default=5
  --scoring SCORING     one of ['accuracy', 'average_precision', 'f1', 'f1_micro', 'f1_macro', 'f1_weighted', 'f1_samples', 'neg_log_loss', 'precision', 'recall', 'roc_auc'], default='accuracy'
  --n_jobs N_JOBS, -j N_JOBS
                        number of jobs to run in parallel, default= 1

    python Gevers2014_IBD_ileum.csv --mrmr_n 20 --over_sampling  --outdir ./ 

    Gevers2014_IBD_ileum.csv is the feature file is the map file
    The `Datasets preview` column on the `Web Server` page shows what the format looks like.

Once the pipeline run successfully, the following message will appear on the screen:

2019-11-14 12:01:32 Start reading data from Gevers2014_IBD_ileum.csv          
2019-11-14 12:01:32 Finish loading data from Gevers2014_IBD_ileum.csv, dimension is (140, 99), label counts Counter({'CD': 78, 'no': 62})               
2019-11-14 12:01:32 Filtered the features with max within_class prevalence lower than 0.2, dimension is (140, 74)  
2019-11-14 12:01:34 Selected 50 features using mrmr            
2019-11-14 12:01:34 Dataset shape Counter({0: 78, 1: 62}) before over sampling    
2019-11-14 12:01:34 Over sampled dataset with SMOTE, shape Counter({1: 78, 0: 78})    
2019-11-14 12:01:34 Select the best tree-based classifiers: ['DecisionTreeClassifier', 'BaggingClassifier','GradientBoostingClassifier', 'AdaBoostClassifier', 'RandomForestClassifier', 'ExtraTreesClassifier', 'XGBClassifier', 'LGBMClassifier']
                and combination of scalers: ['Non', 'Binarizer', 'MinMaxScaler', 'MaxAbsScaler', 'StandardScaler', 'RobustScaler', 'PowerTransformer_YeoJohnson', 'QuantileTransformer_Normal', 'QuantileTransformer_Uniform', 'Normalizer', 'Log1p']     
                and classifiers: ['KNeighborsClassifier', 'GaussianNB', 'LogisticRegression', 'LinearSVC', 'SGDClassifier']   
                Tune each classifier with GridSearchCV     
2019-11-14 12:07:42 Best optimized classifier: RandomForestClassifier, Accuracy:0.74, 
                    Best Param:{'clf': RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=16, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=None, oob_score=False, random_state=0, verbose=0, warm_start=False), 
                    clf__max_depth': 16, 'scl': NonScaler()}      
2019-11-14 12:07:43 Pipeline is finished