mAML: an automated machine learning pipeline with a microbiome repository for human disease classification









OVERVIEW

mAML webserver is a web-based machine learning (ML) system that can automatically generate optimized and interpretable models for binary or multi-class classification tasks in microbiome studies. Once the input files were uploaded, the pipeline will search the best combination of preprocessors and non-tree based classifiers and simutaneously optimize the hyperparameters for all classifiers. When the task is completed, the compressed results will be automatically sent to the predefined user e-mail address. The user can also feed new data to the existing model or upload a previously trained model to make new predictions.

The pipeline includes four steps:

  1. Filtering low prevalence features.
  2. Select a feature subset using distal_DBA/mRMR/HFE/univariate_feature_felection method.
  3. Perform over-sampling for imbalanced data using RandomOverSampler/ADASYN/SMOTE.
  4. Select the best performing combination of data preprocessors and hyperparameter-optimized classifiers.

The pipeline was developed based on the python machine learning package scikit-learn. Any built-in parameters of the 13 classifiers and 10 feature preprocessing methods can be edited on the web interface and 11 different metrics are available for model performance evaluation. Being data-driven, the pipeline can also be used in other classification tasks if only the domain-specific feature matrix is supplied.

Cite mAML

Fenglong Yang, Quan Zou*, mAML: an automated machine learning pipeline with a microbiome repository for human disease classification, Database, Volume 2020, 2020, baaa050, https://doi.org/10.1093/database/baaa050. (SCI, IF2018=3.683)