Chen Lab @ NCST

iNuc-PhysChem

Nucleosome positioning has important roles in key cellular processes. Although intensive efforts have been made in this area, the rules defining nucleosome positioning is still elusive and debated. In this study, we carried out a systematic comparison among the profiles of twelve DNA physicochemical features between the nucleosomal and linker sequences in the Saccharomyces cerevisiae genome. We found that nucleosomal sequences have some position-specific physicochemical features, which can be used for in-depth studying nucleosomes. Meanwhile, a new predictor, called iNuc-PhysChem, was developed for identification of nucleosomal sequences by incorporating these physicochemical properties into a 1788-D (dimensional) feature vector, which was further reduced to a 884-D vector via the IFS (incremental feature selection) procedure to optimize the feature set. It was observed by a cross-validation test on a benchmark dataset that the overall success rate achieved by iNuc-PhysChem was over 96% in identifying nucleosomal or linker sequences.

[Website]

iRSpot-PseDNC

Meiotic recombination is an important biological process. As a main driving force of evolution, recombination provides natural new combinations of genetic variations. Rather than randomly occurring across a genome, meiotic recombination takes place in some genomic regions (the so-called ‘hotspots’) with higher frequencies, and in the other regions (the so-called ‘coldspots’) with lower frequencies. Therefore, the information of the hotspots and coldspots would provide useful insights for in-depth studying of the mechanism of recombination and the genome evolution process as well. So far, the recombination regions have been mainly determined by experiments, which are both expensive and time-consuming. With the avalanche of genome sequences generated in the postgenomic age, it is highly desired to develop automated methods for rapidly and effectively identifying the recombination regions. In this study, a predictor, called ‘iRSpot-PseDNC’, was developed for identifying the recombination hotspots and coldspots. In the new predictor, the samples of DNA sequences are formulated by a novel feature vector, the so-called ‘pseudo dinucleotide composition’(PseDNC), into which six local DNA structural properties, i.e. three angular parameters (twist, tilt and roll) and three translational parameters (shift, slide and rise), are incorporated. It was observed by the rigorous jackknife test that the overall success rate achieved by iRSpot-PseDNC was >82% in identifying recombination spots in Saccharomyces cerevisiae.

[Website]

iPro54-PseKNC

Promoter is a region of DNA that determines the transcription of a particular gene. In prokaryotes, it is the sigma factors of RNA holoenzyme that recognize and bind to the promoter sequences during gene transcription. The sigma-54 promoters are unique in prokaryotic genome and responsible for transcripting carbon and nitrogen-related genes. With the avalanche of genome sequences generated in the postgenomic age, it is highly desired to develop automated methods for rapidly and effectively identifying the sigma-54 promoters. Here, a predictor called ‘iPro54-PseKNC’ was developed. In the predictor, the samples of DNA sequences were formulated by a novel feature vector called ‘pseudo k-tuple nucleotide composition’, which was further optimized by the incremental feature selection procedure. The performance of iPro54-PseKNC was examined by the rigorous jackknife cross-validation tests on a stringent benchmark data set.

[Website]

iORI-PseKNC

The initiation of replication origin is an extremely important process of DNA replication. The distribution of replication origin regions (ORIs) is the major determinant of the timing of genome replication. Thus, correctly identifying ORIs is crucial to understand DNA replication mechanism. With the avalanche of genome sequences generated in the post-genomic age, it is highly desired to develop computational methods for rapidly, effectively and automatically identifying the ORIs in genome. In this paper,we developed a predictor called iORI-PseKNC for identifying ORIs in Saccharomyces cerevisiae genome. In the predictor, based on the concept of the global and long-range sequence-order effects of DNA sequence, the feature called “pseudo k-tuple nucleotide composition” (PseKNC) was used to encode the DNA sequences by incorporating six local structural properties of 16 dinucleotides. The overall success rate of 83.72% was achieved from the jackknife cross-validation test on an objective benchmark dataset.

[Website]

iDNA4mC

DNA N4-methylcytosine (4mC) is an epigenetic modification. The knowledge about the distribution of 4mC is helpful for understanding its biological functions. Although experimental methods have been proposed to detect 4mC sites, they are expensive for performing genome-wide detections. Thus, it is necessary to develop computational methods for predicting 4mC sites. We developed iDNA4mC, the first webserver to identify 4mC sites, in which DNA sequences are encoded with both nucleotide chemical properties and nucleotide frequency. The predictive results of the rigorous jackknife test and cross species test demonstrated that the performance of iDNA4mC is quite promising and holds high potential to become a useful tool for identifying 4mC sites.

[Website]

PseKNC

and PseKNC-General

PseKNC-General (the general form of pseudo k-tuple nucleotide composition), allows for fast and accurate computation of all the widely used nucleotide structural and physicochemical properties of both DNA and RNA sequences. PseKNC-General can generate several modes of pseudo nucleotide compositions, including conventional k-tuple nucleotide compositions, Moreau–Broto autocorrelation coefficient, Moran autocorrelation coefficient, Geary autocorrelation coefficient, Type I PseKNC and Type II PseKNC. In every mode, >100 physicochemical properties are available for choosing. Moreover, it is flexible enough to allow the users to calculate PseKNC with user-defined properties. The package can be run on Linux, Mac and Windows systems and also provides a graphical user interface.

[Website for PseKNC] [Website for PseKNC-General]

iLoc-LncRNA

Long non-coding RNAs (lncRNAs) are a class of RNA molecules with more than 200 bases. They have important functions in cell development and metabolism, including genetic markers, genome rearrangements, chromatin modifications, cell cycle regulation, transcription and Translation, etc. Generally, their functions are also closely related to their location in a cell. The aberrant expression of lncRNAs is associated with several types of cancer, Alzheimer's disease, etc. The study on their localization could provide preliminary insight into their cellular functions. In this work, we designed a sequence-based predictor called “iLoc-LncRNA” to predict the subcellular locations of LncRNAs. In the predictor, a high-quality benchmark dataset including four locations was constructed based on the RNALocate database. The key octonucleotide (8-tuple) features were incorporated into the general PseKNC (Pseudo K-tuple Nucleotide Composition) via the binomial distribution approach to formulate lncRNA samples. The support vector machine (SVM) was adopted to perform discrimination. In rigorous 5-fold cross-validations, we achieved the maximum overall accuracy of 86.72%, suggesting that the proposed predictor is promising and will provide important guidance in this area.

[Website]

iRNA-PseU

As the most abundant RNA modification, pseudouridine plays important roles in many biological processes. Occurring at the uridine site and catalyzed by pseudouridine synthase, the modification has been observed in nearly all kinds of RNA, including transfer RNA, messenger RNA, small nuclear or nucleolar RNA, and ribosomal RNA. Accordingly, its importance to basic research and drug development is self-evident. Despite some experimental technologies have been developed to detect the pseudouridine sites, they are both time-consuming and expensive. Facing the explosive growth of RNA sequences in the postgenomic age, we are challenged to address the problem by computational approaches: For an uncharacterized RNA sequence, can we predict which of its uridine sites can be modified as pseudouridine and which ones cannot? Here a predictor called “iRNA-PseU” was proposed by incorporating the chemical properties of nucleotides and their occurrence frequency density distributions into the general form of pseudo nucleotide composition (PseKNC). It has been demonstrated via the rigorous jackknife test, independent dataset test, and practical genome-wide analysis that the proposed predictor remarkably outperforms its counterpart.

[Website]

iRNA-PseColl

There are many different types of RNA modifications, which are essential for numerous biological processes. Knowledge about the occurrence sites of RNA modifications in its sequence is a key for in-depth understanding of their biological functions and mechanism. Unfortunately, it is both time-consuming and laborious to determine these sites purely by experiments alone. Although some computational methods were developed in this regard, each one could only be used to deal with some type of modification individually. To our knowledge, no method has thus far been developed that can identify the occurrence sites for several different types of RNA modifications with one seamless package or platform. To address such a challenge, a novel platform called “iRNA-PseColl” has been developed. It was formed by incorporating both the individual and collective features of the sequence elements into the general pseudo K-tuple nucleotide composition (PseKNC) of RNA via the chemicophysical properties and density distribution of its constituent nucleotides. Rigorous cross-validations have indicated that the anticipated success rates achieved by the proposed platform are quite high.

[Website]

iRNA-Methyl

Occurring at adenine (A) with the consensus motif GAC, N6-methyladenosine (m6A) is one of the most abundant modifications in RNA, which plays very important roles in many biological processes. The nonuniform distribution of m6A sites across the genome implies that, for better understanding the regulatory mechanism of m6A, it is indispensable to characterize its sites in a genome-wide scope. Although a series of experimental technologies have been developed in this regard, they are both timeconsuming and expensive. With the avalanche of RNA sequences generated in the postgenomic age, it is highly desired to develop computational methods to timely identify their m6A sites. In view of this, a predictor called “iRNA-Methyl” is proposed by formulating RNA sequences with the “pseudo dinucleotide composition” into which three RNA physiochemical properties were incorporated. Rigorous crossvalidation tests have indicated that iRNA-Methyl holds very high potential to become a useful tool for genome analysis.

[Website]

iHSP-PseRAAAC

Heat shock proteins (HSPs) are a type of functionally related proteins present in all living organisms, both prokaryotes and eukaryotes. They play essential roles in protein–protein interactions such as folding and assisting in the establishment of proper protein conformation and prevention of unwanted protein aggregation. Their dysfunction may cause various life-threatening disorders, such as Parkinson’s, Alzheimer’s, and cardiovascular diseases. Based on their functions, HSPs are usually classified into six families: (i) HSP20 or sHSP, (ii) HSP40 or J-class proteins, (iii) HSP60 or GroEL/ES, (iv) HSP70, (v) HSP90, and (vi) HSP100. Although considerable progress has been achieved in discriminating HSPs from other proteins, it is still a big challenge to identify HSPs among their six different functional types according to their sequence information alone. With the avalanche of protein sequences generated in the post-genomic age, it is highly desirable to develop a high-throughput computational tool in this regard. To take up such a challenge, a predictor called iHSP-PseRAAAC has been developed by incorporating the reduced amino acid alphabet information into the general form of pseudo amino acid composition. One of the remarkable advantages of introducing the reduced amino acid alphabet is being able to avoid the notorious dimension disaster or overfitting problem in statistical prediction. It was observed that the overall success rate achieved by iHSP-PseRAAAC in identifying the functional types of HSPs among the aforementioned six types was more than 87%, which was derived by the jackknife test on a stringent benchmark dataset in which none of HSPs included has P40% pairwise sequence identity to any other in the same subset. It has not escaped our notice that the reduced amino acid alphabet approach can also be used to investigate other protein classification problems.

[Website]