Bioinformatics experiments using LibID

Bioinformatics experiments using LibID

1.classifying microRNA precursors from the pseudo hairpins

Ab initio classifying microRNA is important when detecting new microRNA genes. In Xue’s work, there is 193 positive samples and 8494 negative samples. As seen that, negative samples outnumber positive ones, because there are too more hairpins in genome than microRNA precursors. So in their work, under-sampling is employed together with libSVM. They selected 163 positive samples and 168 negative samples as training set, and used 30 positive samples and 1000 negative samples as testing set. Next table contrasted libID and triplet-SVM (their software program).

	triplet-SVM	libID
Sensitivity(sn)	0.93	0.83
Specificity(sp)	0.88	0.91

We considere more negative information so that sp of libID is better than triple-SVM. Sn of triple-SVM is high because of the ‘overfit’, which is also mentioned in Xue’s paper. When dealing with other species, sn is lower than 0.9 and around 0.8 usually.

Furthermore, in this problem, also in some others bioinformatics problems, sp is more important than sn. Because low sn means that some new genes are ignored. However, low sp may waste the proof cost. So biology researchers always prefer high sp to sn.

Reference

C. Xue, F. Li, T. He, G. Liu, Y. Li, X. Zhang, Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine, BMC Bioinformatics 2005,6:p310.

2. searching SNPs from mismatches in the alignment of EST sequences

It is a challenge to discovering SNPs due to the cost of manual experiments. Using the computational methods to mining SNPs from ESTs can help to decrease the cost in time and money. However, it will face the imbalance data when identifying real SNPs from the SNP candidates, which is mined from EST sequences.

Jun Wang(unpublished) has found 3074 SNP candidates from 22994 human EST sequences, in which there are 183 positive samples and 2891 negative samples.

Due to puzzle on feature extraction, it is a weak classification for common classifiers. Here, libID, which is based on ensemble learning, performs better than any other single classifier. In next table we contrast libID with libSVM. Two strategies are adopted for helping libSVM to deal with the imbalance data. But sn and sp both prove the performance of libID.

	libSVM(under-sampling)	libSVM（ensemble and vote）	libID
sn	0.50	0.66	0.81
sp	0.69	0.70	0.82

The experiment suggests that libID performs better when dealing this the weak classification problem.

3. identifying snoRNAs from ncRNAs

Small nucleolar RNAs (snoRNAs), which can be put into two major categories, box C/D and box H/ACA snoRNAs, are a group of non-coding RNAs directing the site-specific modifications for rRNAs and snRNAs. SnoReport(Jana et al,2008) identified snoRNAs from pseudo-ones based on several features, such as MFE. Jana used libSVM on her training set(H/ACA box, C/D box), and get the performance below. LibID works better on these imbalance training set.

RNA	measure	libSVM	libID
H/ACA box snoRNA	sn	0.78	0.86
H/ACA box snoRNA	sp	0.89	0.90
C/D box snoRNA	sn	0.96	0.90
C/D box snoRNA	sp	0.91	0.94

From the table we can see that libID works better than libSVM on H/ACA box snoRNA. And on C/D box ones, the sp is better, which is important than sn for bioinformatics software.

Reference

Jana Hertel, Ivo L.Hofacker, Peter F.Stadler. SnoReport: Computational identification of snoRNAs with unknown targets. Bioinformatics. 2008,24(2):158-164

4. ···

More application in bioinformatics will be added in the future. And if you use libID and get a perfect performance, please tell me and I will acknowledge if you can provide your data. However, if you find libID performs poorly on your data, please do not hesitate to email me, it will help me to improve this software.

Dr. Quan Zou

2008-11-27