Clone the repostory from github after you install git.
python version is 3.7.3
.
Install the packages required using the command: pip3 install -r requirements.txt
.
Alternatively, you can pull the docker image and run the Pipeline in a docker container. Learn how to use docker.
$ usage: sklearn_pipeline.py --help or -h
A pipeline for automatically identify the best performing combinations of scalars and classifiers for microbiomic data
positional arguments:
X_file feature matrix file (required)
Y_file map file (required)
optional arguments:
-h, --help show this help message and exit
--outdir OUTDIR, -o OUTDIR
path to store analysis results, default='./'
--prevalence PREVALENCE, -p PREVALENCE
filter low within-class prevalence features, default= 0.2
--mrmr_n MRMR_N number of features selected with MRMR, default=0
--over_sampling {ADASYN,KMeansSMOTE,SMOTE,SVMSMOTE,BorderlineSMOTE,RandomOverSampler,SMOTENC,None}
over sampling method for the imbalance data (default: SMOTE)
--search tune parameters of each classifier while selecting the best scaler and classifier
--outer_cv OUTER_CV number of fold in the outer loop of nested cross validation default=10
--inner_cv INNER_CV number of fold in the inner loop of nested cross validation, default=5
--scoring SCORING one of ['accuracy', 'average_precision', 'f1', 'f1_micro', 'f1_macro', 'f1_weighted', 'f1_samples', 'neg_log_loss', 'precision', 'recall', 'roc_auc'], default='accuracy'
--n_jobs N_JOBS, -j N_JOBS
number of jobs to run in parallel, default= 1
Example:
python sklearn_pipeline.py Gevers2014_IBD_ileum.csv Gevers2014_IBD_ileum.mf.csv --mrmr_n 20 --over_sampling --outdir ./
Notes:
Gevers2014_IBD_ileum.csv is the feature file
Gevers2014_IBD_ileum.mf.csv is the map file
The `Datasets preview` column on the `Web Server` page shows what the format looks like.
Once the pipeline run successfully, the following message will appear on the screen:
2019-11-14 12:01:32 Start reading data from Gevers2014_IBD_ileum.csv
2019-11-14 12:01:32 Finish loading data from Gevers2014_IBD_ileum.csv, dimension is (140, 99), label counts Counter({'CD': 78, 'no': 62})
2019-11-14 12:01:32 Filtered the features with max within_class prevalence lower than 0.2, dimension is (140, 74)
2019-11-14 12:01:34 Selected 50 features using mrmr
2019-11-14 12:01:34 Dataset shape Counter({0: 78, 1: 62}) before over sampling
2019-11-14 12:01:34 Over sampled dataset with SMOTE, shape Counter({1: 78, 0: 78})
2019-11-14 12:01:34 Select the best tree-based classifiers: ['DecisionTreeClassifier', 'BaggingClassifier','GradientBoostingClassifier', 'AdaBoostClassifier', 'RandomForestClassifier', 'ExtraTreesClassifier', 'XGBClassifier', 'LGBMClassifier']
and combination of scalers: ['Non', 'Binarizer', 'MinMaxScaler', 'MaxAbsScaler', 'StandardScaler', 'RobustScaler', 'PowerTransformer_YeoJohnson', 'QuantileTransformer_Normal', 'QuantileTransformer_Uniform', 'Normalizer', 'Log1p']
and classifiers: ['KNeighborsClassifier', 'GaussianNB', 'LogisticRegression', 'LinearSVC', 'SGDClassifier']
Tune each classifier with GridSearchCV
2019-11-14 12:07:42 Best optimized classifier: RandomForestClassifier, Accuracy:0.74,
Best Param:{'clf': RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=16, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=None, oob_score=False, random_state=0, verbose=0, warm_start=False),
clf__max_depth': 16, 'scl': NonScaler()}
2019-11-14 12:07:43 Pipeline is finished