Classification based on DNA sequence


Balance data

Group Type Literature Year Data Acc URL
Function

Unequal length sequence
DNA Enhancers 2021 Download ↓

Training dataset:
742 strong enhancers
742 weak enhancers
1484 non-enhancers
Testing dataset:
100 strong enhancers
100 weak enhancers
200 non-enhancers
5-CV
layerI: 76.18%
layerII: 62.53%
Independent testing
layerI: 79.75%
layerII: 85.00%
iEnhancer-RF
2021 10-CV
layerI: 81.10%
layerII: 66.74%
Independent testing
layerI: 75.75%
layerII: 63.50%
iEnhancer-XG
2018 jackknife test
layerI: 78.03%
layerII: 65.03%
Independent testing
layerI: 74.75%
layerII: 61.00%
iEnhancer-EL
2016 Download ↓

Training dataset:
742 strong enhancers
742 weak enhancers
1484 non-enhancers
jackknife test
layerI: 77.39%; layerII: 68.19%
EnhancerPred
2015 jackknife test
layerI: 76.89%; layerII: 61.93%
iEnhancer-2L
Promoters 2019 Download ↓

Training dataset:
promoter sequences: 3382
non-promoters: 3382
5-CV
84.06%
iPSW(2L)-PseKNC
Nucleosome 2018 Download ↓

Training dataset:
S1: C. elegans
2567 nucleosome-forming
2608 nucleosome-inhibiting
S2: D. melanogaster
2900 nucleosome-forming
2850 nucleosome-inhibiting
jackknife test
S1: 92.29%
S2: 88.26%
NucPosPred
DNA sequence 2021 Large Data
Download ↓

Training dataset: 4400000 sequences;
Validation dataset: 8000 sequences;
Test dataset: 455024 sequences
AVAUROC: 0.94519
AVAUPR: 0.39522
DeepATT
Modification

Equal length sequence
N4-methycytosine
(4mC)
2021 Download ↓

F. vesca:
Training dataset: P-3457/N-3457;
Testing dataset: P-864/N-864,4320,12960;
R. chinensis:
Training dataset: P-1938/N-1938;
Testing dataset: P-483/N-483,2415,7245;
5-CV
F.vesca: 0.8697
R.chinensis: 0.8541
Independent testing
F.vesca:
P/N=1:1 0.8632; P/N=1:5 0.8632; P/N=1:15 0.8412
R.chinensis:
P/N=1:1 0.8490; P/N=1:5 0.8400; P/N=1:15 0.8477
4mC-w2vec
2020 Download ↓

E.coli:
Training dataset: P-388 /N-388
Testing dataset: P-134 /N-134
10-CV
Training set: 85.4%
Independent testing
83.2%
EC4mC-SVM
2020 Large Data
Download ↓

A. thaliana: P-20000/ N-20000;
C. elegans: P-20000/ N-20000;
D. melanogaster: P-20000/ N-20000;
10-CV
A. thaliana: 84.4%;
C. elegans: 89.3%;
D. melanogaster: 87.1%
Deep4mcPred
2019 Download ↓

DatasetI:
C. elegans: P-1554 /N-1554;
D. melanogaster: P-1769 /N-1769;
A. thaliana: P-1978 /N-1978;
E. coli: P-388/N-388;
G. subterraneus: P-905/N-905;
G. pickeringii: P-569/N-569;
Cross_validation
C. elegans: 0.880; D. melanogaster: 0.874; A. thaliana: 0.825; E. coli: 0.894; G. subterraneus: 0.886;G. pickeringii: 0.907
4mcPred-IFL
2019 10-CV
C.elegans: 0.815; D.melanogaster: 0.830; A.thaliana: 0.787; E.coli: 0.833; G.subterruneus: 0.837; G.pickeringii: 0.860
4mcPred-SVM
2019 jackknife test
C.elegans: 87.71%; D.melanogaster: 87.79%; A.thaliana: 83.37%; E.coli: 94.97%; G.subterruneus: 91.04%; G.pickeringii: 90.89%
Independent testing
C.elegans: 82.21%; D.melanogaster: 82.63%; A.thaliana: 76.52%; E.coli: 82.69%; G.subterruneus: 83.33%; G.pickeringii: 77.63%
4mCPred
2019 Download ↓

Training dataset: DatasetI
Testing dataset:C. elegans: P-750/ N-750;
D. melanogaster: P-1000/ N-1000;
A. thaliana: P-1250/ N-1250;
E. coli: P-134/N-134;
G. subterraneus: P-350/N-350;
G. pickeringii: P-200/ N-200;
10-CV
C.elegans: 0.826;
D.melanogaster: 0.842;
A.thaliana: 0.792;
E.coli: 0.848;
G.subterruneus: 0.855;
G.pickeringii: 0.891
Independent testing
C.elegans: 0.870;
D.melanogaster: 0.906;
A.thaliana: 0.855;
E.coli: 0.825;
G.subterruneus: 0.850;
G.pickeringii: 0.850
Meta-4mCpred
2019 Download ↓

mouse genome
Training dataset: P-800/ N-800
Testing dataset: P-180/ N-180
10-CV
Training dataset: 0.795±0.001
Independent testing
0.798±0.011
4mCpred-EL
N6-methyladenine
(6mA)
2022 Large Data
Download ↓

Benchmark datasets
Rice-Chen: P-154000/ N-154000;
Rice-Lv: P-880/ N-880;
Imbalanced datasets
Rice-Chen(1:5): P-176/ N-880;
Rice-Chen(1:10): P-88/ N-880;
Rice-Chen(1:20): P-44/ N-880;
Rice-Lv(1:5): P-30800/ N-154000;
Rice-Lv(1:10): P-15400/ N-154000;
Rice-Lv(1:20): P-7700/ N-154000;
Independent datasets
NIP_10000: P-10000/ N-10000;
A.thaliana: P-15937/ N-15937;
D.melanogaster: P-11191/ N-11191;
R.chinensis: P-11815/ N-11815;
10-CV
Rice-Chen: 97.00%
Rice-Lv: 96.00%
Independent testing
NIP_10000: 88.00%;
A.thaliana: 77.00%;
D.melanogaster:82.00%;
R.chinensis:86.00%;
MGF6mARice
2021 Large Data
Download ↓

Rice: P-154000/ N-154000;
A.thaliana: P-98483/ N-98483;
F.vesca: P-1417/ N-1417;
R.chinensis: P-5733/ N-5733;
5-CV
Rice: ACC 94.01%;
A.thaliana: AUC 0.954;
F.vesca: AUC 0.982;
R.chinensis: AUC 0.961;
Deep6mA
2021 Large Data
Download ↓

Rice Genome
Dataset-I: P-154000/ N-154000
Dataset-II: P-880/ N-880
5-CV
Dataset-I: 93.82%
Independent testing
Dataset-II: 96.19%
iRicem6A-CNN
2020 10-CV
Dataset-II: 87.27%
Independent testing
Dataset-I: 85.65%
6mA-RicePred
2019 Download ↓

cross-species: P-2768/ N-2716;
Rice: P-880/ N-880;
M. musculus: P-1934/ N-1934;
5-CV
cross-species: 0.824;
Rice: 0.875;
M.musculus: 0.969
iIM-CNN
2019 Download ↓

Cross-species
Training dataset: P-2214/ N-2214;
Testing dataset: P-554/ N-502;
Rice: P-880/ N-880;
M. musculus: P-1934/ N-1934;
5-CV
Cross-species
Training dataset: 0.799;
Rice: 0.861;
M.musculus: 0.966
Independent testing
Cross-species: 0.813;
csDMA

Note: 5-CV:5-fold cross validation; 10-CV: 10-fold cross validation;P: Positive samples; N: Negative samples;



Imbalance data

Group Type Literature Year Data Imbalance Algorithms Acc URL
Function

Unequal length sequence
Promoters 2021 Download ↓

promoter sequences: σ24: 484; σ28: 134; σ32: 291; σ38: 163; σ54: 94; σ70: 1694
non-promoter sequences: 2860
SMOTE 5-CV
layerI: 90.05%
5-CV
layerII: σ24: 97.75%; σ28: 99.84%; σ32: 98.66%; σ38: 99.06%; σ54: 99.94%; σ70: 94.19%
iPro2L-PSTKNC
2018 IHTS 5-CV
layerI: 81.68%
5-CV
layerII: σ24: 93.50%; σ28: 96.85%; σ32: 94.41%; σ38: 94.69%; σ54: 94.04%; σ70: 80.66%
iPromoter-2L
Modification

Equal length sequence
5-methylcytosine
(5mC)


2020 Large Data
Download ↓

Training dataset: P-55800/ N-13950
Testing dataset: P-658861/ N-164715
DSM 5-CV
Training dataset: 90.16%
Independent testing
90.22%
iPromoter-5mC
2015 Download ↓

Methylation samples: 787
Non-methylation samples: 1639
NCR/SMOTE jackknife test
77.49%
iDNA‐Methyl

Note: 3-CV:3-fold cross validation; 5-CV: 5-fold cross validation;P: Positive samples; N: Negative samples;
SMOTE: Synthetic Minority Oversampling Technique;
IHTS: Inserting Hypothetical Training Samples;
DSM: Down-Sampling Method;
NCR: Neighborhood Cleaning Rule;