eFeature.Dr.LV

a python-based toolkit for proteins and peptides sequences deep representation learning embedding features generation and classification

1.Installation：

Run nvidia-smi command to see whether the GPU driver is installed properly.

Environments: install anaconda3 and pytorch(Version 1.2 ) first and a Nvidia RTX20XX or RTX1XXX series GPU is required.

<1> conda create -n eFeature python=3.7

<2> conda activate eFeature

<3> conda install pytorch==1.2.0 torchvision==0.4.0 cudatoolkit=10.0 -c pytorch

<4> pip install lightgbm==3.1.1 umap_learn==0.5.1 seaborn==0.11.1  tape_proteins==0.4 imbalanced_learn==0.8.0  joblib==1.0.1  rich==9.12.4 biovec

<5> download eFeature.zip
 tar -xvf eFeature.zip 
cd eFeature  
 python eFeature.py -h

The eFeature is configured!!

2. Parameters:

Parameters	Description	Notes
--inTrain	proteins or piptides sequences for TRAINING in FASTA format	FASTA formate seq example：>sequence id(digits or letters)\|Label（eg. 0 for negative or 1 for positive） E.G. >seq01\|1 MATQAAATTYVVV
--inTest	proteins or piptides sequences for TESTING in FASTA format	FASTA formate seq example：>sequence id(digits or letters)\|Label（eg. 0 for negative or 1 for positive） E.G. >seq01\|1 MATQAAATTYVVV
--out	output file name in CSV file	The outputs are six deep representations learning features in their separated CSV files and a merged long-feature CSV file of the fused features The 1st column of the CSV file is the sequence id, The 2nd column is the sequence label（0,1,2,...), The columns starting from 3rd columns are feature values（XX_F1,XX_F2,...,XX is the feature name)
--smote	0 is without smote or 1 is for doing smote if the Training dataset is imbalanced	Synthetic Minority Oversampling Technique(SMOTE)
--mode	mode=0: only generate features and do feature selection; mode=1: do feature extraction and selection then do classification	Users could just input FASTA sequence to yield sequence features via mode 0; they could also use mode 1 to generate featers and use six five default machine learning for classification KNN: K Neighbors Classifier LR: Logistic Regression GNB: Gaussian Naive Bayes SVM: support vector machine RF: random forest LGBM: linght gradient boosting machine
--numclass	the default value is 2 for binary classification; if number of classes is not 2, please specify the exact value.	Do binary classification and it could also do multiclass classification
--kfold	the default values is 5	Maching learning model trainind via k-fold validation evalution

Input FASTA example:

3.Usage Example

(1)training dataset name: train_data.fasta
(2)test dataset name: test_data.fasta
(2)output file name: subloc.csv

eg.1 Convert the Training data to eFeature values

 python  eFeature.py --inTrain train_data.fasta --out   subloc.csv

eg.2 Convert the Training and independent Testing data to eFeature values

 python  eFeature.py  --inTrain train_data.fasta  --inTest test_data.fasta -out  subloc.csv

eg.3 Convert the Training and independent Testing data to eFeature values and do SMOTE for training dataset if it is an unbalanced dataset

 python  eFeature.py  --inTrain train_data.fasta  --inTest test_data.fasta -out  subloc.csv --smote 1

eg.4 if it is an unbalanced binary classification problem, it could use command:

  python  eFeature.py  --inTrain train_data.fasta  --inTest test_data.fasta -out  subloc.csv --smote 1 --mode 1

eg.5 It could use the kfold parameters to set up the k-fold validaiton

 python  eFeature.py  --inTrain train_data.fasta  --inTest test_data.fasta -out  subloc.csv  --smote 1 --mode 1 --kfold 10

eg.6 if it is a multiclass-labels (ten classes for example) classification problem, it could use command:

 python  eFeature.py  --inTrain train_data.fasta  --inTest test_data.fasta -out  subloc.csv  --mode 2 --numclass 10

4. Feature Type extracted by eFeature toolkit：

Feature name abbreviations	pretained model
lm	Language Model
BiLSTM	Bidirectional long short term memory
SSA	soft sequence alignment
BERT	Bert-based model
UniRep	mLSTM
W2V	word2vec
FusedAll	The above features are concated to for a long vector

5. Output data：

All the output features values and machine-learning results will be stored in the out folder under the current directory

The Example output of script:

python eFeature.py  --inTrain SubLocTrain.fasta  --inTest SubLocTest.fasta  --out subloc.csv --mode 2 --smote 1 --numclass 10 --kfold 5

5.1 The out folder contained the following folders

5.2 The SSA folder contained the following folders and files

5.3 The machine learing resuts are stored in the Results folder

Foder Name or File name	Meaning
ValANDTest	k-fold validation and independent testing results
ValANDTest_SelectedFeatures	k-fold validation and independent testing results after feature selection preprocessing
SMOTE_ValANDTest	do SMOTE first and get the k-fold validation and independent testing results
SMOTE_lgbmSF_ValANDTest	do SMOTE first and follow by a light graident boosting feature selection, then get the k-fold validation and independent testing results

5.4 Take ValANDTest folder for instance

Foder Name or File name	Meaning
Validation_MeanResults	the mean values of k-fold validation metrics
Test_MeanResults	The mean value of independent testing metrics
EachFold_Validaiton_Test_Results	The validation and independent testing for each k-fold validated machine-learing model

6. More

6.1 UMAP（Uniform Manifold Approximation and Projection for Dimension Reduction)

Once the eFeatue value is generated, the UAMP will be done and result will be saved csv file name with UAMP.

run script: python plot.py and you will attain all UAMP data plotting in the plotting folder.

6.2 lgbmSF(light gradient boosting machine feature selection)

6.3 SMOTE(Synthetic Minority Oversampling Technique)

References：

[1] Zhibin Lv, Feifei Cui, Quan Zou , Lichao Zhang and Lei Xu,Anticancer peptides prediction with deep representation learning features,Briefings in Bioinformatics, 2021, doi: 10.1093/bib/bbab008

[2] Zhibin Lv, Pingping Wang, Quan Zou, Qinghua Jiang, Bioinformatics, 2020,DOI:10.1093/bioinformatics/btaa1074

contact： lvzhibin at pku.edu.cn