Bioinformatics experiments using LibID
1.classifying microRNA precursors from the pseudo hairpins
Ab initio classifying microRNA is
important when detecting new microRNA genes. In Xue’s work, there is 193 positive samples and 8494 negative
samples. As seen that, negative samples outnumber positive ones, because
there are too more hairpins in genome than microRNA precursors. So in their
work, under-sampling is employed together with libSVM. They selected 163
positive samples and 168 negative samples as training set, and used 30 positive
samples and 1000 negative samples as testing set. Next table contrasted libID
and triplet-SVM
(their software program).
|
triplet-SVM |
libID |
Sensitivity(sn) |
0.93 |
0.83 |
Specificity(sp) |
0.88 |
0.91 |
We considere more negative information so that sp of libID is better than triple-SVM. Sn of triple-SVM is high because of the ‘overfit’, which is also mentioned in Xue’s paper. When dealing with other species, sn is lower than 0.9 and around 0.8 usually.
Furthermore, in this problem, also in some others bioinformatics problems, sp is more important than sn. Because low sn means that some new genes are ignored. However, low sp may waste the proof cost. So biology researchers always prefer high sp to sn.
Reference
C. Xue, F. Li, T. He, G. Liu, Y. Li, X. Zhang, Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine, BMC Bioinformatics 2005,6:p310.
2. searching SNPs from mismatches in the alignment of EST sequences
It is a challenge to discovering SNPs due to the cost of manual experiments. Using the computational methods to mining SNPs from ESTs can help to decrease the cost in time and money. However, it will face the imbalance data when identifying real SNPs from the SNP candidates, which is mined from EST sequences.
Jun Wang(unpublished) has found 3074 SNP candidates from 22994 human EST sequences, in which there are 183 positive samples and 2891 negative samples.
Due to puzzle on feature extraction, it is a weak classification for common classifiers. Here, libID, which is based on ensemble learning, performs better than any other single classifier. In next table we contrast libID with libSVM. Two strategies are adopted for helping libSVM to deal with the imbalance data. But sn and sp both prove the performance of libID.
|
libSVM(under-sampling) |
libSVM(ensemble and vote) |
libID |
sn |
0.50 |
0.66 |
0.81 |
sp |
0.69 |
0.70 |
0.82 |
The experiment suggests that libID performs better when dealing this the weak classification problem.
3. identifying snoRNAs from
ncRNAs
Small nucleolar RNAs (snoRNAs), which can be put into two major categories, box C/D and box H/ACA snoRNAs, are a group of non-coding RNAs directing the site-specific modifications for rRNAs and snRNAs. SnoReport(Jana et al,2008) identified snoRNAs from pseudo-ones based on several features, such as MFE. Jana used libSVM on her training set(H/ACA box, C/D box), and get the performance below. LibID works better on these imbalance training set.
RNA |
measure |
libSVM |
libID |
H/ACA box snoRNA |
sn |
0.78 |
0.86 |
sp |
0.89 |
0.90 |
|
C/D box snoRNA |
sn |
0.96 |
0.90 |
sp |
0.91 |
0.94 |
From the table we can see that libID works better than libSVM on H/ACA box snoRNA. And on C/D box ones, the sp is better, which is important than sn for bioinformatics software.
Reference
Jana Hertel, Ivo L.Hofacker, Peter F.Stadler. SnoReport:
Computational identification of snoRNAs with unknown targets. Bioinformatics. 2008,24(2):158-164
4. ···
More application in bioinformatics will be added in the future. And if you use libID and get a perfect performance, please tell me and I will acknowledge if you can provide your data. However, if you find libID performs poorly on your data, please do not hesitate to email me, it will help me to improve this software.