Title: Towards standardization of the description and publication of next‐generation sequencing datasets of fungal communities
Abstract: New PhytologistVolume 191, Issue 2 p. 314-318 LettersFree Access Towards standardization of the description and publication of next-generation sequencing datasets of fungal communities R. Henrik Nilsson, R. Henrik Nilsson Department of Plant and Environmental Sciences, University of Gothenburg, Box 461, 405 30 Göteborg, Sweden Institute of Ecology and Earth Sciences, University of Tartu. 46 Vanemuise St. 51014 Tartu, Estonia (Author for correspondence: tel +46 31 7862623; email [email protected])Search for more papers by this authorLeho Tedersoo, Leho Tedersoo Institute of Ecology and Earth Sciences, University of Tartu. 46 Vanemuise St. 51014 Tartu, Estonia Natural History Museum of Tartu University, 46 Vanemuise St. 51014 Tartu, EstoniaSearch for more papers by this authorBjörn D. Lindahl, Björn D. Lindahl Department of Forest Mycology and Pathology, Swedish University of Agricultural Sciences, Box 7026, 750 07 Uppsala, SwedenSearch for more papers by this authorRasmus Kjøller, Rasmus Kjøller Biological Institute, Terrestrial Ecology, University of Copenhagen, Øster Farimagsgade 2D, DK-1353 Copenhagen, DenmarkSearch for more papers by this authorTor Carlsen, Tor Carlsen Department of Biology, University of Oslo, PO Box 1066 Blindern, N-0316 Oslo, NorwaySearch for more papers by this authorChristopher Quince, Christopher Quince School of Engineering, University of Glasgow, Glasgow G12 8LT, UKSearch for more papers by this authorKessy Abarenkov, Kessy Abarenkov Institute of Ecology and Earth Sciences, University of Tartu. 46 Vanemuise St. 51014 Tartu, EstoniaSearch for more papers by this authorTaina Pennanen, Taina Pennanen The Finnish Forest Research Institute, PL 18, FI-01301 Vantaa, FinlandSearch for more papers by this authorJan Stenlid, Jan Stenlid Department of Forest Mycology and Pathology, Swedish University of Agricultural Sciences, Box 7026, 750 07 Uppsala, SwedenSearch for more papers by this authorTom Bruns, Tom Bruns Department of Plant and Microbial Biology, University of California, Berkeley, CA 94720, USASearch for more papers by this authorKarl-Henrik Larsson, Karl-Henrik Larsson The Mycological Herbarium, Natural History Museum, University of Oslo, PO Box 1172, Blindern, N-0318 Oslo, NorwaySearch for more papers by this authorUrmas Kõljalg, Urmas Kõljalg Institute of Ecology and Earth Sciences, University of Tartu. 46 Vanemuise St. 51014 Tartu, Estonia Natural History Museum of Tartu University, 46 Vanemuise St. 51014 Tartu, EstoniaSearch for more papers by this authorHåvard Kauserud, Håvard Kauserud Department of Biology, University of Oslo, PO Box 1066 Blindern, N-0316 Oslo, NorwaySearch for more papers by this author R. Henrik Nilsson, R. Henrik Nilsson Department of Plant and Environmental Sciences, University of Gothenburg, Box 461, 405 30 Göteborg, Sweden Institute of Ecology and Earth Sciences, University of Tartu. 46 Vanemuise St. 51014 Tartu, Estonia (Author for correspondence: tel +46 31 7862623; email [email protected])Search for more papers by this authorLeho Tedersoo, Leho Tedersoo Institute of Ecology and Earth Sciences, University of Tartu. 46 Vanemuise St. 51014 Tartu, Estonia Natural History Museum of Tartu University, 46 Vanemuise St. 51014 Tartu, EstoniaSearch for more papers by this authorBjörn D. Lindahl, Björn D. Lindahl Department of Forest Mycology and Pathology, Swedish University of Agricultural Sciences, Box 7026, 750 07 Uppsala, SwedenSearch for more papers by this authorRasmus Kjøller, Rasmus Kjøller Biological Institute, Terrestrial Ecology, University of Copenhagen, Øster Farimagsgade 2D, DK-1353 Copenhagen, DenmarkSearch for more papers by this authorTor Carlsen, Tor Carlsen Department of Biology, University of Oslo, PO Box 1066 Blindern, N-0316 Oslo, NorwaySearch for more papers by this authorChristopher Quince, Christopher Quince School of Engineering, University of Glasgow, Glasgow G12 8LT, UKSearch for more papers by this authorKessy Abarenkov, Kessy Abarenkov Institute of Ecology and Earth Sciences, University of Tartu. 46 Vanemuise St. 51014 Tartu, EstoniaSearch for more papers by this authorTaina Pennanen, Taina Pennanen The Finnish Forest Research Institute, PL 18, FI-01301 Vantaa, FinlandSearch for more papers by this authorJan Stenlid, Jan Stenlid Department of Forest Mycology and Pathology, Swedish University of Agricultural Sciences, Box 7026, 750 07 Uppsala, SwedenSearch for more papers by this authorTom Bruns, Tom Bruns Department of Plant and Microbial Biology, University of California, Berkeley, CA 94720, USASearch for more papers by this authorKarl-Henrik Larsson, Karl-Henrik Larsson The Mycological Herbarium, Natural History Museum, University of Oslo, PO Box 1172, Blindern, N-0318 Oslo, NorwaySearch for more papers by this authorUrmas Kõljalg, Urmas Kõljalg Institute of Ecology and Earth Sciences, University of Tartu. 46 Vanemuise St. 51014 Tartu, Estonia Natural History Museum of Tartu University, 46 Vanemuise St. 51014 Tartu, EstoniaSearch for more papers by this authorHåvard Kauserud, Håvard Kauserud Department of Biology, University of Oslo, PO Box 1066 Blindern, N-0316 Oslo, NorwaySearch for more papers by this author First published: 09 May 2011 https://doi.org/10.1111/j.1469-8137.2011.03755.xCitations: 77AboutSectionsPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL Fungi play fundamental roles in the nutrient cycling process in most terrestrial ecosystems, notably through forming symbiotic associations such as mycorrhiza with plants and through decomposition of wood and plant debris (Stajich et al., 2009). The fact that fungi spend most of their life cycle below ground or within substrates has left the scientific community with a fragmentary understanding of fungal diversity, and a modest c. 7% of the estimated 1.5 million extant species of fungi have been described (Kirk et al., 2008). The poor correlation between the presence of fungal fruiting bodies or other macroscopic structures and the full diversity of the mycobiome at any sample site has shifted the focus in fungal ecology from fruiting bodies to molecular (DNA sequence) data, and nearly all recent attempts to characterize fungal communities are based on sequence data (Taylor, 2008). Such studies have hitherto been limited in sequence depth by the high cost and investment of effort associated with traditional Sanger sequencing of large numbers of samples, but recent methodological progress in the form of next-generation sequencing (NGS) technologies (Shendure & Ji, 2008) offers a remedy to these problems. One of these NGS technologies – massively parallel ('454') pyrosequencing (Margulies et al., 2005) – has the capacity to generate more than a million sequences of c. 500 base-pairs (bp) length in the course of a day, making it a groundbreaking tool for environmental sequencing of fungi. For all the research venues opened by the NGS in general and pyrosequencing in particular, the technologies remain fairly complicated and may, in the absence of generally acknowledged standards, even prove counterproductive to fungal ecology and mycology at large. Various types of incompletely understood biases are introduced at different steps of the analyses (Quince et al., 2009; Bellemain et al., 2010; Tedersoo et al., 2010), and these are often paid little attention to during the interpretation of the results. Approaches to delimitation of species or operational taxonomic units (OTUs) from molecular data differ widely among users, as do ideas on how to handle abundance data, taxonomic standards, and ecological classifications (Hibbett et al., 2011). New web-based software has been developed specifically for sequence processing and analysis of fungal pyrosequencing data – for example, CLOTU (http://www.bioportal.uio.no), SCATA (http://scata.mykopat.slu.se), and PlutoF (http://unite.ut.ee) – and these resources approach sequence clustering and identification in different ways. Other areas where standards are lacking include what data and files to make available with the study in question and what estimates, statistics, and level of detail to report as part of the results. As a consequence, most NGS studies measure and quantify slightly to fundamentally different things and often do so in ways that are neither very clear to the reader nor directly amenable to independent repetition and verification. Hibbett et al. (2011) provided an overview of the 10 first 454-based studies of fungal communities. While each study represented a significant achievement, differences in the nature and level of detail reported made precise comparison difficult. In more than half of the cases, email exchange with the corresponding authors was necessary for clarification, verification, and access to additional data. If allowed to continue unabated, the heterogeneous, nonstandardized reporting of NGS data on fungal communities will prevent the detailed comparisons of communities and biological processes that the NGS technologies were hoped to enable. That is not in anyone's interest, and in this letter we introduce a set of core elements we feel every NGS-based study of fungal communities should report on. A standard for how environmental NGS data should be generated and analysed is probably not warranted – or even desirable – at this stage, since many aspects of the processing and analysis of NGS data remain tentative and are likely to vary with the scope of the study. Our proposals are instead oriented towards how data and results should be reported and made available to the scientific community (Table 1 and see later). We hope that they will be considered a lower bound on the level of detail necessary, although we recognize that it may not be possible to generate all of them for every NGS dataset. Table 1. Elements we suggest all next-generation sequencing (NGS) studies of fungal communities should report on in a clear and comprehensive way Elements to report on Example(s) Sequence data: filtering, denoising, and availability Raw sequence data file The 454 SFF file and the corresponding unprocessed FASTA and tag/barcode files were deposited in the European Nucleotide Archive (ENA) as ERR00000X Filtering and trimming Leading and trailing, but not intercalary, ambiguity symbols were pruned in Flower 0.7 (http://biohaskell.org/Applications/Flower) before analysis. Sequences with more than 2% intercalary ambiguity symbols were discarded. All sequences shorter than 250 base-pairs (bp). after the removal of barcodes, tags, and primers were discarded. All sequences longer than 450 bp were trimmed down to 450 bp from the 3′-end Sequence denoising AmpliconNoise 1.2 (Quince et al., 2011) was used to denoise the entries; default settings were used Number of discarded and retained sequences Out of a total of 140 000 sequences, 18 234 were found to be of poor read quality, 15 213 too short, and 5430 potentially chimeric, leaving 101 123 (72%) sequences for downstream analyses Sequencing depth (A) Of the 101 123 sequences retained, 24 832 represented sample A, 31 323 sample B, 25 118 sample C, and 19 850 sample D. (B) The number of sequences per sample ranged from 4201 to 7912 in the 15 samples FASTA files All processed sequences passing the filtering step are available in the FASTA format as Supplementary Item X together with separate FASTA files representing each sample Sequence data: analysis and taxonomic assignment Details of the genetic marker used (A) The full ITS1 of the nuclear ITS region as extracted using Nilsson et al. (2010) was used. (B) The V2 and V3 regions – and their conserved intercalary segment – of the 18S was used (Hartmann et al., 2010). Sequences covering < 75% of the full length of the target region were discarded Type and specifics of sequence clustering Complete-link clustering at 98.5% similarity (global alignment) was done in UCLUST 2.1 (Edgar, 2010) with the most abundant sequence types serving as cluster seeds. A list of the reads in each operational taxonomic unit (OTU) is provided in Supplementary Item X Sequence data used for taxonomic annotation The most frequent sequence type in each cluster was used for the BLAST searches, and the corresponding FASTA file is provided as Supplementary Item X Specification of the taxonomic reference database All fully identified entries in INSD (Benson et al., 2011) and UNITE (Abarenkov et al., 2010) as of December 2010 were used as reference sequences Specification of the taxonomic annotation procedure BLAST 2.2.22 (Altschul et al. 1997) was used. ≥ 97% similarity across the entire length of the pairwise alignment was taken to indicate conspecificity; however, ≥ 99% was required in the cases of Cortinarius, Aspergillus and Penicillium. If only one reference sequence was available for some given species, or if the taxonomic annotation was contradicted by another, equally close reference sequence, an asterisk and a question mark, respectively, was added to the taxonomic annotation of the sequence. Greater than or equal to 90% similarity was taken to approximate the genus level (e.g. Hydnum sp.) and ≥ 60% similarity the ordinal level (e.g. Boletales sp.). Only sequences determined at least to ordinal level were used for the phylum-level comparison of Fig. X. Sequences not having a ≥ 60% BLAST match to any reference sequence were considered potentially compromised; were marked as such; and were excluded from the ecological statistics but not from the list of OTUs recovered Handling of singletons and OTUs with few reads (A) All singletons were discarded. (B) All OTUs with fewer than 5 reads were excluded from further analysis. (C) Only singletons at least 90% identical to the reference sequence of a species not otherwise recovered in the study were kept Post-clustering/taxonomic results Count of OTUs recovered A total of 242 nonsingleton OTUs were recovered. Forty-two were unique to single samples, and the rest were shared by two or more samples (Supplementary Item X) List of OTUs recovered A complete annotated list of the abundance of all fungal OTUs recovered in each sample is provided in the QIIME format (http://qiime.sourceforge.net/documentation/file_formats.html) as Supplementary Item X Taxonomic affiliations The OTUs were identified to species, genus, or order as applicable, and the taxonomic affiliations are provided as a separate file (Supplementary Item X) Proportion of fully identified OTUs We tentatively identified 72 (30%) of the 242 OTUs to species level. Of the remaining 170 OTUs, we tentatively identified 60 to genus and 88 to order level Phylum-level distribution Ninety-seven per cent of the OTUs were of fungal origin. Of these, 58% belonged to Basidiomycota; 22% to Ascomycota; 12% to Glomeromycota; and 8% were found to belong to other fungal lineages One or more examples (A–C as applicable) are given for each element; they remain examples however and should not necessarily be seen as recommendations of methodology or specific software packages. We have no opinion on the exact form in which the elements should be reported in a publication, and we view this item as a checklist rather than as a mandatory table. Most current NGS studies of fungal communities rely on pyrosequencing, but this is a situation that may change. Whereas our recommendations should be at least conceptually compatible with other existing and emerging NGS technologies such as Illumina sequencing (Bentley, 2006), it is likely that both refinement and adaptation will eventually be needed. Data availability Detailed comparisons and meta-analyses of studies are only possible if the underlying data are made available for download. This is not always the case, however, necessitating email exchange with authors regarding files that may no longer exist (Whitlock et al., 2010). To make the data available through the authors' personal web pages is similarly a makeshift solution that does not meet the criterion of long-term availability to the scientific community (Wren, 2008). We propose that all data relevant to the re-analysis and interpretation of fungal NGS studies should be deposited in central data archives. The European Nucleotide Archive (ENA; Leinonen et al., 2011) is the recommended resource for storage of raw NGS data, including flowgram ('SFF') and unprocessed FASTA files in the case of pyrosequencing datasets. In addition, we propose that all relevant processed and derived files that were used to generate the results of any given NGS study – and that normally cannot be archived in ENA – should be deposited at the publisher's site as supplementary data along with the article presenting the study in question. Description of sample site and laboratory procedures For the description of the sample site and sampling conditions, we propose that the MIMARKS/MIxS standard (Yilmaz et al., 2010) should be followed. In so far as the specifics of each sample differ, we argue that full metadata should be given for each sample. We advocate that the laboratory procedures should be described in comprehensive detail – including full primer sequence data, polymerase chain reaction (PCR) enzyme specifics, and other points routinely left out by many authors – and we discourage the practice of referring to other articles instead of providing the corresponding information in writing. Sequence data: analysis and taxonomic assignment Fully automated, high-quality species identification from fungal sequence data is presently not possible for any non-trivial assemblage of fungal lineages, leaving caution and taxonomic expertise as important elements in molecular identification of fungi. In particular we wish to discourage the use of single, static similarity thresholds for species demarcation as far as possible; these thresholds should be allowed to vary to better account for differences among fungal lineages (e.g. Nilsson et al., 2008; Hughes et al., 2009; Seifert, 2009). Any specific threshold value tailored for, e.g., the internal transcribed spacer (ITS) region, will typically carry over poorly to other genetic markers, such as the ribosomal small subunit. The molecular identification procedure is underspecified in many NGS studies, making independent repetition difficult. Discretionary consideration Molecular identification of fungi is fraught with methodological complications, but above all it is severely hampered by the lack of reference sequences for much of the extant diversity of fungi. It would seem appropriate to plan each NGS-study of environmental samples so that taxonomic expertise is accounted for among, or available to, the authors of the study. If also some part of the budget could be allocated to generating reference sequences from fruiting bodies relevant for the ecosystem and geographical region under study, then those NGS studies would contribute to the reference sequence databases and ultimately the possibilities to disentangle the diversity discovered (Brock et al., 2009). Conclusion Each NGS study represents a considerable investment in terms of time and money, and for that investment to be of maximum use to the broader scientific community, the data generated and results obtained should be presented and made available in a comprehensive, transparent way. We believe the elements discussed earlier are a significant contribution to the development of such a standard, and their specification is unlikely to take more than a few hours. The NGS is the most exciting development in fungal ecology for many years, and correctly employed it will enable great strides to be made towards a much deeper understanding of fungi and their trophic roles in ecosystems. References Abarenkov K, Nilsson RH, Larsson K-H, Alexander IJ, Eberhardt U, Erland S, Høiland K, Kjøller R, Larsson E, Pennanen T et al. 2010. The UNITE database for molecular identification of fungi – recent updates and future perspectives. New Phytologist 186: 281– 285. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25: 3389– 3402. Bellemain E, Carlsen T, Brochmann C, Coissac E, Taberlet P, Kauserud H. 2010. ITS as an environmental DNA barcode for fungi: an in silico approach reveals potential PCR biases. BMC Microbiology 10: 189. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. 2011. GenBank. Nucleic Acids Research 39: D32– D37. Bentley DR. 2006. Whole genome re-sequencing. Current Opinion in Genetics and Development 16: 545– 552. Brock PM, Döring H, Bidartondo MI. 2009. How to know unknown fungi: the role of a herbarium. New Phytologist 181: 719– 724. Edgar RC. 2010. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26: 2460– 2461. Hartmann M, Howes CG, Abarenkov K, Mohn WW, Nilsson RH. 2010. V-Xtractor: an open-source, high-throughput software tool to identify and extract hypervariable regions of small subunit (16S/18S) ribosomal RNA gene sequences. Journal of Microbiological Methods 83: 250– 253. Hibbett DS, Ohman A, Glotzer D, Nuhn M, Kirk P, Nilsson RH. 2011. Progress in molecular and morphological taxon discovery in Fungi and options for formal classification of environmental sequences. Fungal Biology Reviews 25: 38– 47. Hughes KW, Petersen RH, Lickey EB. 2009. Using heterozygosity to estimate a percentage DNA sequence similarity for environmental species' delimitation across basidiomycete fungi. New Phytologist 182: 795– 798. Kirk PM, Cannon PF, Minter DW, Stalpers JA. 2008. Dictionary of the fungi, 10th edn . Wallingford, UK: CABI Publishing. Leinonen R, Akhtar R, Birney E, Bower L, Cerdeno-Tárraga A, Cheng Y, Cleland I, Faruque N, Goodgame N, Gibson R et al. 2011. The European Nucleotide Archive. Nucleic Acids Research 39: D28– D31. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen Y-J, Chen Z et al. 2005. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437: 376– 380. Nilsson RH, Kristiansson E, Ryberg M, Hallenberg N, Larsson K-H. 2008. Intraspecific ITS variability in the kingdom Fungi as expressed in the international sequence databases and its implications for molecular species identification. Evolutionary Bioinformatics 4: 193– 201. Nilsson RH, Veldre V, Hartmann M, Unterseher M, Amend A, Bergsten J, Kristiansson E, Ryberg M, Jumpponen A, Abarenkov K. 2010. An open source software package for automated extraction of ITS1 and ITS2 from fungal ITS sequences for use in high-throughput community assays and molecular ecology. Fungal Ecology 3: 284– 287. Quince C, Lanzén A, Curtis TP, Davenport RJ, Hall N, Head IM, Read LF, Sloan WT. 2009. Accurate determination of microbial diversity from 454 pyrosequencing data. Nature Methods 6: 639– 641. Quince C, Lanzén A, Davenport RJ, Turnbaugh PJ. 2011. Removing noise from pyrosequenced amplicons. BMC Bioinformatics 12: 38. Seifert KA. 2009. Progress towards DNA barcoding of fungi. Molecular Ecology Resources 9: S83– S89. Shendure J, Ji H. 2008. Next-generation DNA sequencing. Nature Biotechnology 26: 1135– 1145. Stajich JE, Berbee ML, Blackwell M, Hibbett DS, James TY, Spatafora JW, Taylor JW. 2009. The Fungi. Current Biology 19: R840– R845. Taylor AFS. 2008. Recent advances in our understanding of fungal ecology. Coolia 52: 197– 212. Tedersoo L, Nilsson RH, Abarenkov K, Jairus T, Sadam A, Saar I, Bahram M, Bechem E, Chuyong G, Kõljalg U. 2010. 454 Pyrosequencing and Sanger sequencing of tropical mycorrhizal fungi provide similar results but reveal substantial methodological biases. New Phytologist 188: 291– 301. Whitlock MC, McPeek MA, Rausher MD, Rieseberg L, Moore AJ. 2010. Data archiving. American Naturalist 175: 145– 146. Wren JD. 2008. URL decay in MEDLINE – a 4-year follow-up study. Bioinformatics 24: 1381– 1385. Yilmaz P, Kottmann R, Field D, Knight R, Cole JR, Amaral-Zettler L, Gilbert JA, Karsch-Mizrachi I, Johnston A, Cochrane G et al. 2010. The "Minimum Information about an ENvironmental Sequence" (MIENS) specification, version 2. Nature Precedings. doi:10.1038/npre.2010.5252.2 Citing Literature Volume191, Issue2July 2011Pages 314-318 ReferencesRelatedInformation