Taxonomic Profiling of Metagenomes

Protocol provided by Anna Sintsova.

Taxonomic profiling of complex microbial communities is an essential first step in the investigation of relationship between community composition and environmental and/or health factors. The most common approach to community profiling is amplification and classification the 16S rRNA gene. Methods related to 16S rRNA analysis are discussed in detail in Amplicon Sequencing. Recently shotgun metagenomic sequencing has started to replace the amplicon based approaches, as it provides higher resolution information about the microbial community, and resolves some of the biases associated with 16S approach. A number of software tools have been developed to taxonomically profile metagenomic samples. These tools have been benchmarked in recent studies. Here we’re going to talk about the use of mOTUs and mTAGs for taxonomic profiling.

mOTUs

mOTUs determines the composition of metagenomic samples using 10 single copy phylogenetic marker genes and an extensive database consisting of reference genomes, metagenomes and metagenome assembled genomes from 23 different environments. Different use cases and applications are discussed in detail in a recent publication and on mOTUs website. Here we provide a quick reference guide to basic mOTUs functionality.

Note

Please download sample data and conda environment file for this section if you want to follow along. See the Tutorials section for instructions on how to unpack the data and create the conda environment. mOTUs installation requires database download, so expect it to take a little bit of time.

flowchart LR id1( Taxonomic profiling with mOTUs) --> id2(data preprocessing) id2 --> id3(profile fa:fa-cog mOTUs) id3 --> id4(merge fa:fa-cog mOTUs) classDef tool fill:#96D2E7,stroke:#F8F7F7,stroke-width:1px; style id1 fill:#5A729A,stroke:#F8F7F7,stroke-width:1px,color:#fff style id2 fill:#F78A4A,stroke:#F8F7F7,stroke-width:1px class id3,id4 tool

Data Preprocessing. Before taxonomic profiling, it is important to preprocess the raw sequencing data. Standard preprocessing protocols are described in Data Preprocessing.

Important

In addition to standard quality control and adapter trimming, we also suggest merging of paired-end reads (see Data Preprocessing for more details). Using merged reads increases speed and accuracy.

Profile. Taxonomic profiles for each sample can be generated using mOTUs profile command. The output profile will consist of identified mOTUs and their abundance.

mkdir motus_profiles
motus profile -f  reads/ERR479298_sub1_R1.fq.gz \
     -r reads/ERR479298_sub1_R2.fq.gz \
     -n ERR479298_sub1 -o motus_profiles/ERR479298_sub1.motus -c -k mOTU -q -p
motus profile -f  reads/ERR479298_sub2_R1.fq.gz \
     -r reads/ERR479298_sub2_R2.fq.gz \
     -n ERR479298_sub2 -o motus_profiles/ERR479298_sub2.motus -c -k mOTU -q -p

`-f`	input file(s) for reads in forward orientation
`-r`	input file(s) for reads in reverse orientation
`-s`	input file(s) for unpaired reads (singletons or merged pair end reads)
`-n`	sample name
`-o`	output file name
`-c`	print result as counts instead of relative abundances
`-k`	taxonomic level (kingdom, phylum, class, order, family, genus, mOTU)
`-q`	print the full rank taxonomy
`-p`	print NCBI taxonomy identifiers

Important

Expect mOTU counts (when run with -c option) to be relatively small (compared to total number of reads in your sample). The counts are proportional to the library size, and you can expect ~600 mOTU counts for 5,000,000 reads. If you still think you should be getting higher counts, please see FAQ for common issues.

Note

The unassigned at the end of the profile file represents the fraction of unmapped reads. This represents species that we know to be present in the sample, but we are not able to quantify individually; hence we group them together into an unassigned fraction. For almost all the analysis, it is better to remove this value, since it does not represent a single species/clade. Please see FAQ for more information.

Merge. Individual taxonomic profiles can be merged together using mOTUs merge command to facilitate downstream analysis.

motus merge -i motus_profiles/ERR479298_sub1.motus,motus_profiles/ERR479298_sub1.motus -o motus_profiles/merged.motus

`-i`	list of mOTU profiles to merge
`-o`	output file name

mTAGs

mTAGs generates taxonomic profiles from short-read metagenomic sequencing data using small subunit of the ribosomal RNA (SSU-rRNA). The mTAGs tool uses a reference database built by clustering sequences within each genus defined in SILVA 138 into OTUs at 97% identity. Each OTU is represented in the database as a degenerate consensus sequence (generated using the IUPAC DNA code). mTAGs detects sequencing reads belonging to SSU-rRNA and annotates them through the alignment to consensus reference sequences. For more information about the methods please see the mTAGs paper

flowchart LR id1( Taxonomic profiling with mTAGs) --> id2(data preprocessing) id2 --> id3(profile fa:fa-cog mTAGs) id3 --> id4(merge fa:fa-cog mTAGs) classDef tool fill:#96D2E7,stroke:#F8F7F7,stroke-width:1px; style id1 fill:#5A729A,stroke:#F8F7F7,stroke-width:1px,color:#fff style id2 fill:#F78A4A,stroke:#F8F7F7,stroke-width:1px class id3,id4 tool

Data Preprocessing. As always, it is important to preprocess the raw sequencing data. Standard preprocessing protocols are described in Data Preprocessing. As with mOTUs, we also suggest merging of paired-end reads (see Data Preprocessing for more details).
Download mTAGs_ database.

mtags download

Profile. Taxonomic profiles for each sample can be generated using mTAGs profile command. The tool produces profiles at 8 different taxonomic levels (root, domain, phylum, class, order, family, genus, and otu). Root level combines all domains, the otu level was generated by clustering of sequences within each genus. Each profile will have an ‘Unaligned’ and ‘Unassigned’ entry, these represent sequences that could not be aligned or could not be assigned at a given taxonomic level. These need to be taken into account when calculating relative abundances, but should be removed for most of downstream analyses.

mkdir mtags_profiles
mtags profile -f  reads/ERR479298_sub1_R1.fq.gz \
     -r reads/ERR479298_sub1_R2.fq.gz \
     -n ERR479298_sub1 -o mtags_profiles
mtags profile -f  reads/ERR479298_sub2_R1.fq.gz \
     -r reads/ERR479298_sub2_R2.fq.gz \
     -n ERR479298_sub2 -o mtags_profiles

`-f`	input file(s) for reads in forward orientation
`-r`	input file(s) for reads in reverse orientation
`-s`	input file(s) for unpaired reads (singletons or merged pair end reads)
`-n`	sample name
`-o`	output directory

Merge. Individual taxonomic profiles can be merged together using mTAGs merge on *.bins files produced by mtags profile.

mtags merge -i mtags_profiles/*bins -o mtags_profiles/merged.mtags

`-i`	list of mOTU profiles to merge
`-o`	output file name

Choosing between mOTUs and mTAGs

mOTUs and mTAGs both generate taxonomic profiles from shotgun metagenomic data, however they differ in their approaches. The choice of the tool will depend on the specific dataset and question at hand.

Here are a few considerations to keep in mind:

mTAGs and mOTUs rely on different methodologies for classification. mTAGs uses rRNA sequences clustered at 97% identity, while mOTUs relies on 10 universal single-copy marker genes.
If you would like to compare your data to rRNA-based studies (for example 16S rRNA amplicon), mTAGs would be a better choice.
Since mOTUs does not rely on rRNA genes (unlike mTAGs), it avoids the potential problem of copy number variation.
mTAGs relies on SILVA database, which in general has a better coverage of diversity. The % of not profiled reads is usually much lower in mTAGs compared to mOTUs. However, this is highly dependent on the environment being studied.
Very often the resolution of the mOTUs clusters is higher than that of rRNA OTUs. As a consequence, a single 16S sequence can correspond to multiple mOTUs.
The general patterns found in alpha and beta diversity correlate well between these two methods.
mOTUs profiles can provide additional information beyond the taxonomic annotation: ref-mOTUs are directly linked to genomes (through specIs defined in ProGenomes2) and ext-mOTUs are obtained from MAGs. This allows to explore the gene content of the profiled mOTUs, which is not possible for mTAGs profiles, which are defined based on 16S rRNA sequences.

MAPseq

MAPseq is a fast and accurate taxonomic classification tool. Since it relies on rRNA sequences for profiling, it can be applied to both amplicon and metagenomic data.

Important

Workflow coming soon!