Data Set Acquisition-Positive
(1)Original Data Set Acquisition.:The original positive sample set included 8556 protein sequences downloaded from the Web site provided by Le et al.Their data were retrieved from UniProt,which has high credibility.
(2)Acquisition of High-Quality Data Set:To ensure accuracy of experimental results and low redundancy or nonredundancy between data and sample set data, the data set was processed as follows: (1) protein sequences containing invalid alphabets such as B, J, O, U, X, and Z were deleted, (2) protein sequences with lengths smaller than 50 were deleted, and (3) cd-hit redundancy was eliminated.
Negative data set and training set
(1)acquisition of negative examples (under a large quantity of negative examples, random example extraction was conducted from 9630 negative examples 10 times to guarantee the accuracy of the final classification result (2000 examples each time), and 10 negative-example sets were obtained).
(2)acquisition of positive examples (2000 out of 2678 data items were extracted to comprise the positive-example set according to proportions of the five compounds; the quantities of these five complexes were 1379, 66, 113, 255,and 187)
