a python-based toolkit for proteins and peptides sequences deep representation learning embedding features generation and classification


1.Installation:

code: DOWNLOAD

Run nvidia-smi command to see whether the GPU driver is installed properly.

Environments: install anaconda3 and pytorch(Version 1.2 ) first and a Nvidia RTX20XX or RTX1XXX series GPU is required.

<1> conda create -n eFeature python=3.7
<2> conda activate eFeature
<3> conda install pytorch==1.2.0 torchvision==0.4.0 cudatoolkit=10.0 -c pytorch 
<4> pip install lightgbm==3.1.1 umap_learn==0.5.1 seaborn==0.11.1  tape_proteins==0.4 imbalanced_learn==0.8.0  joblib==1.0.1  rich==9.12.4 biovec 
<5> download eFeature.zip
tar -xvf eFeature.zip
cd eFeature
python eFeature.py -h

The eFeature is configured!!

2. Parameters:

Parameters Description Notes
--inTrain proteins or piptides sequences for TRAINING in FASTA format FASTA formate seq example:>sequence id(digits or letters)|Label(eg. 0 for negative or 1 for positive)
E.G.
>seq01|1
MATQAAATTYVVV
--inTest proteins or piptides sequences for TESTING in FASTA format FASTA formate seq example:>sequence id(digits or letters)|Label(eg. 0 for negative or 1 for positive)
E.G.
>seq01|1
MATQAAATTYVVV
--out output file name in CSV file The outputs are six deep representations learning features in their separated CSV files and a merged long-feature CSV file of the fused features
The 1st column of the CSV file is the sequence id,
The 2nd column is the sequence label(0,1,2,...),
The columns starting from 3rd columns are feature values(XX_F1,XX_F2,...,XX is the feature name)
--smote 0 is without smote or 1 is for doing smote if the Training dataset is imbalanced Synthetic Minority Oversampling Technique(SMOTE)
--mode mode=0:
only generate features and do feature selection;
mode=1:
do feature extraction and selection then do classification
Users could just input FASTA sequence to yield sequence features via mode 0;
they could also use mode 1 to generate featers and use six five default machine learning for classification
KNN: K Neighbors Classifier
LR: Logistic Regression
GNB: Gaussian Naive Bayes
SVM: support vector machine
RF: random forest
LGBM: linght gradient boosting machine
--numclass the default value is 2 for binary classification; if number of classes is not 2, please specify the exact value. Do binary classification and it could also do multiclass classification
--kfold the default values is 5 Maching learning model trainind via k-fold validation evalution

Input FASTA example:

3.Usage Example

(1)training dataset name: train_data.fasta
(2)test dataset name: test_data.fasta
(2)output file name: subloc.csv

eg.1 Convert the Training data to eFeature values

 python  eFeature.py --inTrain train_data.fasta --out   subloc.csv

eg.2 Convert the Training and independent Testing data to eFeature values

 python  eFeature.py  --inTrain train_data.fasta  --inTest test_data.fasta -out  subloc.csv

eg.3 Convert the Training and independent Testing data to eFeature values and do SMOTE for training dataset if it is an unbalanced dataset

 python  eFeature.py  --inTrain train_data.fasta  --inTest test_data.fasta -out  subloc.csv --smote 1

eg.4 if it is an unbalanced binary classification problem, it could use command:

  python  eFeature.py  --inTrain train_data.fasta  --inTest test_data.fasta -out  subloc.csv --smote 1 --mode 1

eg.5 It could use the kfold parameters to set up the k-fold validaiton

 python  eFeature.py  --inTrain train_data.fasta  --inTest test_data.fasta -out  subloc.csv  --smote 1 --mode 1 --kfold 10

eg.6 if it is a multiclass-labels (ten classes for example) classification problem, it could use command:

 python  eFeature.py  --inTrain train_data.fasta  --inTest test_data.fasta -out  subloc.csv  --mode 2 --numclass 10

4. Feature Type extracted by eFeature toolkit:

Feature name abbreviations pretained model
lm Language Model
BiLSTM Bidirectional long short term memory
SSA soft sequence alignment
BERT Bert-based model
UniRep mLSTM
W2V word2vec
FusedAll The above features are concated to for a long vector

5. Output data:

All the output features values and machine-learning results will be stored in the out folder under the current directory

The Example output of script:

python eFeature.py  --inTrain SubLocTrain.fasta  --inTest SubLocTest.fasta  --out subloc.csv --mode 2 --smote 1 --numclass 10 --kfold 5

5.1 The out folder contained the following folders

5.2 The SSA folder contained the following folders and files

5.3 The machine learing resuts are stored in the Results folder

Foder Name or File name Meaning
ValANDTest k-fold validation and independent testing results
ValANDTest_SelectedFeatures k-fold validation and independent testing results after feature selection preprocessing
SMOTE_ValANDTest do SMOTE first and get the k-fold validation and independent testing results
SMOTE_lgbmSF_ValANDTest do SMOTE first and follow by a light graident boosting feature selection, then get the k-fold validation and independent testing results

5.4 Take ValANDTest folder for instance

Foder Name or File name Meaning
Validation_MeanResults the mean values of k-fold validation metrics
Test_MeanResults The mean value of independent testing metrics
EachFold_Validaiton_Test_Results The validation and independent testing for each k-fold validated machine-learing model

6. More

6.1 UMAP(Uniform Manifold Approximation and Projection for Dimension Reduction)

Once the eFeatue value is generated, the UAMP will be done and result will be saved csv file name with UAMP.

run script: python plot.py and you will attain all UAMP data plotting in the plotting folder.

6.2 lgbmSF(light gradient boosting machine feature selection)

6.3 SMOTE(Synthetic Minority Oversampling Technique)

References:

[1] Zhibin Lv, Feifei Cui, Quan Zou , Lichao Zhang and Lei Xu,Anticancer peptides prediction with deep representation learning features,Briefings in Bioinformatics, 2021, doi: 10.1093/bib/bbab008

[2] Zhibin Lv, Pingping Wang, Quan Zou, Qinghua Jiang, Bioinformatics, 2020,DOI:10.1093/bioinformatics/btaa1074


contact: lvzhibin at pku.edu.cn