latest release of HAlign 2: HAlign2.1.jar source code
latest release of HAlign 3: HAlign2.1.jar source code
We are now actively developing a new version. We strongly recommend you to try this version if you want to align nucleic acid sequences.
A series of detailed usage commands are showed in document part. Besides, a user-friendly server are ready now.
Development environments:
# java -jar HAlign2.1.jar <mode> <input-file> <output-file> <algorithm>
-localMSA
, -localTree
.-localMSA
mode, but none
for -localTree
mode. 0
represents the suffix tree algorithm, the
fastest, but only for DNA/RNA; 1
represents the KBand algorithm based BLOSUM62
scoring matrix, only for Protein; 2
represents the KBand algorithm based on
affine gap penalty, only for DNA/RNA; 3
represents the trie tree alignment
algorithm, but slower and only for DNA/RNA; 4
represents the basic algorithm
based the similarity matrix, the slowest and only for DNA/RNA. But it is the most accurate
in the case of lower sequences similarity.
# hadoop jar HAlign2.1.jar <mode> <input-file> <output-file> <algorithm>
-hadoopMSA
.
# spark-submit --class main HAlign2.1.jar <mode> <input-file> <output-file> <algorithm>
-sparkMSA
, -sparkTree
.-sparkMSA
mode, but none
for -sparkTree
mode. 0
represents the suffix tree algorithm, the
fastest, but only for DNA/RNA; 1
represents the KBand algorithm based BLOSUM62
scoring matrix, only for Protein.
Fig. 1. Workflow of Trie Tree
As shown in the above figure, trie tree approach need to find the center star sequence first, then sum up the results for the final results. The Trie Tree Version runs quickly than the K-Band Version. However, the K-Band Version considers the Affine Gap Penalty while the Trie Tree Version didn't. K-Band approach can be applied for DNA, RNA and Protein multiple sequences alignment, which has a key role in multiple sequences alignment. More detailed information is shown in citation paper.Fig. 2. Workflow of Hadoop Map Reduce
Based on map-reduce technique of hadoop, our program can work as hadoop mode for minimizing time cost. Our program reads input file from local file system, then be transformed into format key-value pair lines file and saved in hadoop distributed file system (HDFS). Next, pair sequences alignment procedure will be handed out, which is distributed parallel computing. Our program will search the unique center star sequence after this step, then next step will begin. After summing up procedure, reduce method will generate the final result. More detailed information is shown in citation paper.Copyright@2016 by Bioinformatics Laboratory, Tianjin University.