测试数据集
DNA/RNA真实数据集
Ancient human partial mitochondrial DNA
古代人类线粒体高可变区I和高可变区II的基因序列,长度350bp左右,293条高可变区I,100条高可变区II,数据集混在一起,比对时需要分开使用
Ref: Ancient DNA Reveals Key Stages in the Formation of Central European Mitochondrial Genetic Diversity. DOI: 10.1126/science.1241844
下载 (zipped file): mt.zip
mitochondrial genomes
人类线粒体基因组,672条高度相似的DNA序列,最大长度16579bp
Ref: Tanaka M., et al. (2004) Mitochondrial genome variation in eastern Asia and the peopling of Japan. Genome Res,14(10a), 1832-1850.
Download: 1x (219KB) 20x (4.27MB) 50x (10.666MB) 100x (21.325MB)
Human Genome
人类基因组基因
人类Y染色体基因组
共44条,均由多个片段拼接而成(参考序列见 CHM13_T2T_v2.0)
Download: Link
Ref1: Hallast, Pille, et al. “Assembly of 43 human Y chromosomes reveals extensive complexity and variation.” Nature (2023): 1-10.
Ref2: Rhie, Arang, et al. “The complete sequence of a human Y chromosome.” Nature (2023): 1-11.
3条人类1号染色体
16S rRNA
16S rRNA基因是细菌上编码rRNA相对应的DNA序列,相似度不高,最大长度1586bp
Ref: DeSantis, T. Z., et al.(2006) NAST: a multiple sequence alignment server for comparative analysis of 16S rRNA genes. Nucleic Acids Res, 34, W394-399.
Download: small (21.864MB) big (197.224MB)
Mycobacterium 23S rRNA sequences
641 Mycobacterium 23S rRNA sequences with lengths ranging from 1909 to 3485 bp were downloaded as another dataset from the SILVA rRNA database of Bacteria, Archaea and Eukarya.
Download: 23s_rRNA.tar.xz
COVID-19
猴痘数据集
完整数据集(即完整测序的数据集 MPoxBR.complete.fasta.xz)相似度较高,有 1739 条序列,没有对齐,fasta 格式的 DNA 数据集,长度在 183230~210919 之间,平均长度 197135
部分数据集(即部分测序的数据集 PoxBR.incomplete.fasta.xz)相似度较低,有 4631 条序列,没有对齐,fasta 格式的 DNA 数据集,长度在 17~228869 之间,平均长度 177921
最后更新日期:2023-08-19
Dataset Mainpage
Dataset Link
Neisseria-meningitidis
Streptococcus-pneumoniae
11 条序列,长度 2,138,975~2,139,054,fasta 格式
Download: link
Escherichia-coil
30 条序列,长度 5,060,511~5,060,749,fasta 格式。由于数据过大,采用了分卷压缩
Download: part 1, part 2
GMGC 数据集
Global Microbial Gene Catalog 数据集,fasta 格式。
原始数据下载
按照长度进行分组存储
Ref: Coelho, L.P., et al. Towards the biogeography of prokaryotic genes. Nature 601, 252–256 (2022).
HIV 数据集
DNA/RNA模拟数据集
Hierarchical tree simulated datasets
Star-tree simulated datasets
14 simulated datasets, each dataset contained 1000 sequences in each dataset with different similarities (99%, 98%, 97%, 96%, 95%, 94%, 93%, 92%, 91%, 90%, 85%, 80%, 70%, and 60%).
Ref: HAlign 3: Fast Multiple Alignment of Ultra-Large Numbers of Similar DNA/RNA Sequences.
Download: star_simudata_1000seq.tar.xz
RNA模拟数据
模拟noncoding数据
蛋白质数据集
MUSCLE 数据集
extHomFam 数据集
QuanTest2 数据集
蛋白质数据集,相似度较低
Ref: Sievers, F. et al. (2020) QuanTest2: benchmarking multiple sequence alignments using secondary structure prediction, Bioinformatics, 36(1), January 2020, 90–95.
原版利用二级结构进行预测,下载地址
利用 Q Score 和 TC Score 进行得分计算,Dataset Mainpage
|