Data Set Acquisition-Positive
(1)Original Data Set Acquisition.:The original positive sample set included 8556 protein sequences downloaded from the Web site provided by Le et al.Their data were retrieved from UniProt,which has high credibility.
Here
(2)Acquisition of High-Quality Data Set:To ensure accuracy of experimental results and low redundancy or nonredundancy between data and sample set data, the data set was processed as follows: (1)
protein sequences containing invalid alphabets such as B, J, O,
U, X, and Z were deleted, (2) protein sequences with lengths
smaller than 50 were deleted, and (3) cd-hit redundancy was
eliminated.
Here
Negative data set and training set
(1)acquisition of negative
examples (under a large quantity of negative examples, random
example extraction was conducted from 9630 negative
examples 10 times to guarantee the accuracy of the final
classification result (2000 examples each time), and 10
negative-example sets were obtained).
(2)acquisition of
positive examples (2000 out of 2678 data items were extracted
to comprise the positive-example set according to proportions
of the five compounds; the quantities of these five complexes
were 1379, 66, 113, 255,and 187)
Click here to download the related data
Here
Click here to download the 10 training datasets
Here