Imbalanced Datasets on Post-Translational Modifications  (ImData_PTMs)

  • Lysine Glutarylation
    1. RF-GlutarySite: a random forest based predictor for glutarylation sites (2019)
      Datasets: Training (Positive:400; Negative:1703)---PLMD & Tan's
               Testing (Positive:44; Negative:203)
      Feature extraction/selection: AAIndex & AAfactor & FEBS / XGBoost
      Algorithms: Under-sampling; RF
  • Lysine Succinylation
    1. Inspector: a lysine succinylation predictor based on edited nearest-neighbor undersampling and adaptive synthetic oversampling (2020)
      Datasets: Training (Positive:4755; Negative:50565)---ProtKB/Swiss-Prot & NCBI
               Testing (Positive:254; Negative:2977)
      Feature extraction/selection: DBPB & PSDAAP & PseAAC & PWAA & EGAAC & CKSAAGP; F-score & IFS
      Algorithms: ENN & ADASYN; RF
    2. SSKM_Succ: A novel succinylation sites prediction method incorprating K-means clustering with a new semi-supervised learning algorithm (2020)
      Datasets: Training (Positive:4695; Unlabeled:47027);Validation (Positive:1815; Unlabeled:24509)---PLMD,Uniprot
               Testing (Positive:2608; Unlabeled:1050)---dbPTM
      Feature extraction/selection: proximal PTMs & Grey model &DNC & PSAAP; RF-based selection
      Algorithms: semi-supervised (Clustering & SVM)
    3. Characterization and Identification of Lysine Succinylation Sites based on Deep Learning Method (2019)
      Datasets: Training (Positive:3216; Negative:16408 [4 negative samples are deleted])---PLMD
               Testing (Positive:218; Negative:2621)
      Feature extraction/selection: PSAAC & CKSAAP(Top 400 by mRMR) & PSSM
      Algorithms: Maximal Dependence Decomposition (MDD); CNN
    4. Detecting Succinylation sites from protein sequences using ensemble support vector machine (2018)
      Datasets: Training (Positive:4755; Negative:50565)---ProtKB/Swiss-Prot & NCBI (same as Inspector)
               Testing (Positive:254; Negative:2977)
      Feature extraction/selection: AAC & BE & PCP & GPAAC; IG
      Algorithms: Bootstrap Sampling; SVM & Ensemble method
  • Lysine Glycation
    1. Predicting lysine glycation sites using bi-profile bayes feature extraction (2017)
      Datasets: Training (Positive:223; Negative:446)---CPLM
      Features: BPB
      Algorithms: SVM
  • Histidine Phosphorylation
    1. PROSPECT: A web server for predicting protein histidine phosphorylation sites (2020)
      Datasets: Training (Positive:219; Negative:1277)---UniProt
               Testing (Positive:25; Unlabeled:143)
      Features: one-of-K & EGAAC & CKSAAGP
      Algorithms: Weighted Sum by two CNN & one RF classifiers
  • Lysine SUMOylation
    1. SUMO-Forest: A Cascade Forest based method for the prediction of SUMOylation sites on imbalanced data (2020)
      Datasets: Training (Positive:755; Negative:9944)---UniProt
      Features: PSAAP & PseAAP & BK(bi-gram and k-skip-bigram) & SP(statistics property)
      Algorithms: Cascade Forest
  • Lysine Formylation
    1. Prediction of lysine formylation sites using the composition of k-spaced amino acid pairs via Chou's 5-steps rule and general pseudo components (2020)
      Datasets: Training (Positive:182; Negative:1637)---PLMD
      Features: CKSAAP(Top 300 by F-score)
      Algorithms: Biased SVM
  • Serine/Threonine O-GlcNAcylation
    1. O-GlcNAcPRED-II: an integrated classification algorithm for identifying O-GlcNAcylation sites based on fuzzy undersampling and a K-means PCA oversampling technique (2018)
      Datasets: Training (Positive:945; Negative:50914)---dbOGAP & Jochmann's
               Testing (Positive:368; Unlabeled:27139)
      Features: AAC & DAA & BPB & ANBPB & DBPB & PSAAP & PSDAAP & PSTAAP
      Algorithms: KPCA-FUS; Rotation Forest (KNN, RF, NB, SVM)
    Last updated: 2020-8-19
    E-mail address: doulijun777@163.com.