Sequence clustering in bioinformatics: An empirical study

Introduction

Sequence clustering is a basic bioinformatics task that is attracting renewed attention with the development of metagenomics and microbiomics. The latest sequencing techniques have decreased costs and as a result, massive amounts of DNA/RNA sequences are being produced.

We selected several popular clustering tools, briefly explained the key computing principles, analyzed their characters, and compared them using two independent benchmark datasets.

Our aim is to assist bioinformatics users in employing suitable clustering tools effectively to analyze big sequencing data.

Datasets for download

We chose two datasets for experiment redundancy remove and one dataset for OTU clustering testing.

Greengenes is a chimera-checked 16S rRNA gene database. And SWISS-PROT is a protein database. We chose these two datasets to observe the performance of several software on different type of datasets.

The third dataset are 16S rRNA gene sequences that are generated using Illumina's MiSeq platform using paired end reads.

For experiment redundancy remove

Greengenes (1.7GB)
SWISS-PROT (260MB)

For OTU clustering

Mothur SOP dataset (884.3MB)

References

Quan Zou, Gang Lin, Xingpeng Jiang, Xiangrong Liu, Xiangxiang Zeng. Sequence clustering in bioinformatics: an empirical study. Briefings in Bioinformatics. Doi: 10.1093/bib/bby090
DeSantis T Z, Hugenholtz P, Larsen N, et al. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Applied and environmental microbiology, 2006, 72(7): 5069-5072.
Boeckmann B, Bairoch A, Apweiler R, et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic acids research, 2003, 31(1): 365-370.
Kozich JJ, Westcott SL, Baxter NT, Highlander SK, Schloss PD. (2013): Development of a dual-index sequencing strategy and curation pipeline for analyzing amplicon sequence data on the MiSeq Illumina sequencing platform. Applied and Environmental Microbiology. 79(17):5112-20.