Recently, two de Bruijn-type assemblers, MetaVelvet and Meta-IDBA [ 42 ] have been released that deal explicitly with the non-clonality of natural populations. Both assemblers aim to identify within the entire de Bruijn graph a sub-graph that represents related genomes. Alternatively, the metagenomic sequence mix can be partition into "species bins" via k-mer binning Titus Brown, personal communications. Those subgraphs or subsets are then resolved to build a consensus sequence of the genomes.
For Meta-IDBA a improvement in terms of N50 and maximum contig length has been observed when compared to "classical" de Bruijn assembler e. Velvet or SOAP; results from the personal experience of the authors; data not shown here. The development of "metagenomic assemblers" is however still at an early stage, and it is difficult to access their accuracy for real metagenomic data as typically no references exist to compare the results to.
A true gold standard i. Several factors need to be considered when exploring the reasons for assembling metagenomic data; these can be condensed to two important questions. First, what is the length of the sequencing reads used to generate the metagenomic dataset, and are longer sequences required for annotation?
Some approaches, e. On the whole, however, the longer the sequence information, the better is the ability to obtain accurate information. One obvious impact is on annotation: the longer the sequence, the more information provided, making it easier to compare with known genetic data e. Annotation issues will be discussed in the next section.
Binning and classification of DNA fragments for phylogenetic or taxonomic assignment also benefits from long, contiguous sequences and certain tools e.
Phylopythia work reliably only over a specific cut-off point e. Second, is the dataset assembled to reduce data-processing requirements? Here, as an alternative to assembling reads into contigs, clustering near-identical reads with cd-hit [ 45 ] or uclust [ 46 ] will provide clear benefits in data reduction. Fundamentally, assembly is also driven by the specific problem that single reads have generally lower quality and hence lower confidence in accuracy than do multiple reads that cover the same segment of genetic information.
Therefore, merging reads increases the quality of information. Obviously in a complex community with low sequencing depth or coverage, it is unlikely to actually get many reads that cover the same fragment of DNA.
Hence assembly may be of limited value for metagenomics. Unfortunately, without assembly, longer and more complex genetic elements e. Hence there is a need for metagenomic assembly to obtain high-confidence contigs that enable the study of, for example, major repeat classes. However, none of the current assembly tools is bias-free. Several strategies have been proposed to increase assembly accuracy [ 38 ], but strategies such as removal of rare k-mers are no longer considered adequate, since rare k-mers do not represent sequence errors as initially assumed , but instead represent reads from less abundant pan-genomes in the metagenomic mix.
Binning refers to the process of sorting DNA sequences into groups that might represent an individual genome or genomes from closely related organisms. Several algorithms have been developed, which employ two types of information contained within a given DNA sequence. Firstly, compositional binning makes use of the fact that genomes have conserved nucleotide composition e. Secondly, the unknown DNA fragment might encode for a gene and the similarity of this gene with known genes in a reference database can be used to classify and hence bin the sequence.
There is also number of binning algorithms that consider both composition and similarity, including the programs PhymmBL [ 55 ] and MetaCluster [ 56 ]. All these tools employ different methods of grouping sequences, including self-organising maps SOMs or hierarchical clustering, and are operated in either an unsupervised manner or with input from the user supervised to define bins. Important considerations for using any binning algorithm are the type of input data available and the existence of a suitable training datasets or reference genomes.
In general, composition-based binning is not reliable for short reads, as they do not contain enough information. For example, a bp read can at best possess only less than half of all possible 4-mers and this is not sufficient to determine a 4-mer distribution that will reliably relate this read to any other read.
Compositional assignment can however be improved, if training datasets e. These "training" fragments can either be derived from assembled data or from sequenced fosmids and should ideally contain a phylogenetic marker such as a rRNA gene that can be used for high-resolution, taxonomic assignment of the binned fragments [ 57 ].
Short reads may contain similarity to a known gene and this information can be used to putatively assign the read to a specific taxon. This taxonomic assignment obviously requires the availability of reference data. If the query sequence is only distantly related to known reference genomes, only a taxonomic assignment at a very high level e. If the metagenomic dataset, however, contains two or more genomes that would fall into this high taxon assignment, then "chimeric" bins might be produced.
In this case, the two genomes might be separated by additional binning based on compositional features. In general, however this might again require that the unknown fragments have a certain length. Binning algorithm will obviously in the future benefit from the availability of a greater number and phylogenetic breadth of reference genomes, in particular for similarity-based assignment to low taxonomic levels. Post-assembly the binning of contigs can lead to the generation of partial genomes of yet-uncultured or unknown organisms, which in turn can be used to perform similarity-based binning of other metagenomic datasets.
Caution should however been taken to ensure the validity of any newly created genome bin, as "contaminating" fragments can rapidly propagate into false assignments in subsequent binning efforts. Prior to assembly with clonal assemblers binning can be used to reduce the complexity of an assembly effort and might reduce computational requirement.
For the annotation of metagenomes two different initial pathways can be taken. First, if reconstructed genomes are the objective of the study and assembly has produced large contigs, it is preferable to use existing pipelines for genome annotation, such as RAST [ 58 ] or IMG [ 59 ].
For this approach to be successful, minimal contigs length of 30, bp or longer are required. Second, annotation can be performed on the entire community and relies on unassembled reads or short contigs. Here the tools for genome annotation are significantly less useful than those specifically developed for metagenomic analyses.
Annotation of metagenomic sequence data has in general two steps. First, features of interest genes are identified feature prediction and, second, putative gene functions and taxonomic neighbors are assigned functional annotation.
Feature prediction is the process of labeling sequences as genes or genomic elements. All of these tools use internal information e. These missing genes can potentially be identified by BLAST-based searches, however the size of current metagenomic datasets makes this computational expensive step often prohibitive.
There exists also a number of tools for the prediction of non-protein coding genes such as tRNAs [ 66 , 67 ], signal peptides [ 68 ] or CRISPRs [ 69 , 70 ], however they might require significant computational resources or long contiguous sequences. Clearly subsequent analysis depends on the initial identification of features and users of annotation pipelines need to be aware of the specific prediction approaches used.
Functional annotation represents a major computational challenge for most metagenomic projects and therefore deserves much attention now and over the next years. We note that annotation is not done de novo , but via mapping to gene or protein libraries with existing knowledge i.
Any sequences that cannot be mapped to the known sequence space are referred to as ORFans. These ORFans are responsible for the seemingly never-ending genetic novelty in microbial metagenomics e. Three hypotheses exist for existence of this unknown fraction.
Secondly, these ORFans are real genes, but encode for unknown biochemical functions. Third, ORFan genes have no sequence homology with known genes, but might have structural homology with known proteins, thus representing known protein families or folds.
Future work will likely reveal that the truth lies somewhere between these hypotheses [ 77 ]. For improving the annotation of ORFan genes, we will rely on the challenging and labor-intensive task of protein structure analysis e. Currently, metagenomic annotation relies on classifying sequences to known functions or taxonomic units based on homology searches against available "annotated" data.
Metagenomic datasets are typically very large, so manual annotation is not possible. Automated annotation therefore has to become more accurate and computationally inexpensive. Currently, running a BLASTX similarity search is computationally expensive; as much as ten times the cost of sequencing [ 78 ].
Unfortunately, computationally less demanding methods involving detecting feature composition in genes [ 44 ] have limited success for short reads. With growing dataset sizes, faster algorithms are urgently needed, and several programs for similarity searches have been developed to resolve this issue [ 46 , 79 - 81 ]. It is essential that metagenome analysis platforms be able to share data in ways that map and visualize data in the framework of other platforms. These metagenomic exchange languages should also reduce the burden associated with re-processing large datasets, minimizing, the redundancy of searching and enabling the sharing of annotations that can be mapped to different ontologies and nomenclatures, thereby allowing multifaceted interpretations.
The Genomic Standards Consortium GSC with the M5 project is providing a prototypical standard for exchange of computed metagenome analysis results, one cornerstone of these exchange languages. Several large-scale databases are available that process and deposit metagenomic datasets.
Results are expressed in the form of abundance profiles for specific taxa or functional annotations. The MG-RAST web interface allows comparison using a number of statistical techniques and allows for the incorporation of metadata into the statistics. These statistics demonstrate a move by the scientific community to centralize resources and standardize annotation.
Other systems, such as CAMERA [ 74 ], offer more flexible annotation schema but require that individual researchers understand the annotation of data and analytical pipelines well enough to be confident in their interpretation. Also for comparison, all datasets need to be analyzed using the same workflow, thus adding additional computational requirements. The use of dendrograms to display metagenomic data provides a collapsible network of interpretation, which makes analysis of particular functional or taxonomic groups visually easy.
Owing to the high costs, many of the early metagenomic shotgun-sequencing projects were not replicated or were focused on targeted exploration of specific organisms e. Reduction of sequencing cost see above and a much wider appreciation of the utility of metagenomics to address fundamental questions in microbial ecology now require proper experimental designs with appropriate replication and statistical analysis.
These design and statistical aspects, while obvious, are often not properly implemented in the field of microbial ecology [ 88 ]. However, many suitable approaches and strategies are readily available from the decades of research in quantitative ecology of higher organisms e. This is analogous to species-sample matrices in ecology of higher organisms, and hence many for the statistical tools available to identify correlations and statistically significant patterns are transferable.
As metagenomic data however often contain many more species or gene functions then the number of samples taken, appropriate corrections for multiple hypothesis testing have to be implemented e. Bonferroni correction for t-test based analyses. The Primer-E package [ 89 ] is a well-established tool, allowing for a range of multivariate statistical analyses, including the generation of multidimensional scaling MDS plots, analysis of similarities ANOSIM , and identification of the species or functions that contribute to the difference between two samples SIMPER.
Recently, multivariate statistics was also incorporated in a web-based tools called Metastats [ 90 ], which revealed with high confidence discriminatory functions between the replicated metagenome dataset of the gut microbiota of lean and obese mice [ 91 ]. In addition, the ShotgunFunctionalizeR package provides several statistical procedures for assessing functional differences between samples, both for individual genes and for entire pathways using the popular R statistical package [ 92 ].
Ideally, and in general, experimental design should be driven by the question asked rather than technical or operational restriction. For example, if a project aims to identify unique taxa or functions in a particular habitat, then suitable reference samples for comparison should be taken and processed in consistent manner. In addition, variation between sample types can be due to true biological variation, something biologist would be most interested in and technical variation and this should be carefully considered when planning the experiment.
One should also be aware that many microbial systems are highly dynamic, so temporal aspects of sampling can have a substantial impact on data analysis and interpretation. While the question of the number of replicates is often difficult to predict prior to the final statistical analysis, small-scale experiments are often useful to understand the magnitude of variation inherent in a system.
For example, a small number of samples could be selected and sequenced to shallower depth, then analyzed to determine if a larger sampling size or greater sequencing effort are required to obtain statistically meaningful results [ 88 ].
Also, the level at which replication takes place is something that should not lead to false interpretation of the data. For example, if one is interested in the level of functional variation of the microbial community in habitat A, then multiple samples from this habitat should be taken and processed completely separately, but in the same manner.
Taking just one sample and splitting it up prior to processing will provide information only about technical, but not biological, variation in habitat A. Taking multiple samples and then pooling them will lose all information on variability and hence will be of little use for statistical purposes. Ultimately, good experimental design of metagenomic projects will facilitate integration of datasets into new or existing ecological theories [ 93 ].
As metagenomics gradually moves through a range of explorative biodiversity surveys, it will also prove itself extremely valuable for manipulative experiments. These will allow for observation of treatment impact on the functional and phylogenetic composition of microbial communities. Initial experiments already showed promising results [ 94 ]. However, careful experimental planning and interpretations should be paramount in this field. One of the ultimate aims of metagenomics is to link functional and phylogenetic information to the chemical, physical, and other biological parameters that characterize an environment.
While measuring all these parameters can be time-consuming and cost-intensive, it allows retrospective correlation analysis of metagenomic data that was perhaps not part of the initial aim of the project or might be of interest for other research questions. The value of such metadata cannot be overstated and, in fact, has become mandatory or optional for deposition of metagenomic data into some databases [ 50 , 74 ]. Data sharing has a long tradition in the field of genome research, but for metagenomic data this will require a whole new level of organization and collaboration to provide metadata and centralized services e.
In order to enable sharing of computed results, some aspects of the various analytical pipelines mentioned above will need to be coordinated - a process currently under way under the auspices of the GSC. Once this has been achieved, researchers will be able to download intermediate and processed results from any one of the major repositories for local analysis or comparison.
A suite of standard languages for metadata is currently provided by the M inimum I nformation about any x S equence checklists MIxS [ 95 ]. The question of centralized versus decentralized storage is also one of "who pays for the storage," which is a matter with no simple answer. Targeted Integrated Systems. Learn about our discovery platform and gene editing toolbox. Recent News. Metagenomi Appoints Dr. Willard H. Dere to Its Board of Directors. This website uses cookies to improve your experience.
We'll assume you're ok with this, but you can opt-out if you wish. Privacy Policy. Close Privacy Overview This website uses cookies to improve your experience while you navigate through the website. Identifying the complete set of transcripts RNA-seq from microbial environments.
Multilocus sequence typing MLST. Technique to detect variability of housekeeping genes for identifying bacterial strains. Multiplex sequencing.
DNA fragments from different samples are pooled and sequenced all together. Operational taxonomic units OTU. Sequence based species cluster defined by 16S gene sequence similarity. Orthologous genes. Functional identical genes in different species that evolved from a common ancestral gene.
The entire gene set of a species. Phred score Q score. Quality value for each sequenced base. Reads per kilo base per million RPKM. Normalization to compare coverage of genes. Ribosomal sequences 16S.
0コメント