Taxonomic Profiling of Metagenomes
Protocol provided by Anna Sintsova.
Taxonomic profiling of complex microbial communities is an essential first step in the investigation of relationship between community composition and environmental and/or health factors. The most common approach to community profiling is amplification and classification the 16S rRNA gene. Methods related to 16S rRNA analysis are discussed in detail in Amplicon Sequencing. Recently shotgun metagenomic sequencing has started to replace the amplicon based approaches, as it provides higher resolution information about the microbial community, and resolves some of the biases associated with 16S approach. A number of software tools have been developed to taxonomically profile metagenomic samples. These tools have been benchmarked in recent studies. Here we’re going to talk about the use of mOTUs and mTAGs for taxonomic profiling.
mOTUs
mOTUs determines the composition of metagenomic samples using 10 single copy phylogenetic marker genes and an extensive database consisting of reference genomes, metagenomes and metagenome assembled genomes from 23 different environments. Different use cases and applications are discussed in detail in a recent publication and on mOTUs website. Here we provide a quick reference guide to basic mOTUs functionality.
Note
Please download sample data
and conda environment file
for this section if you want to follow along. See the Tutorials section for instructions on how to unpack the data and create the conda environment. mOTUs installation requires database download, so expect it to take a little bit of time.
Data Preprocessing. Before taxonomic profiling, it is important to preprocess the raw sequencing data. Standard preprocessing protocols are described in Data Preprocessing.
Important
In addition to standard quality control and adapter trimming, we also suggest merging of paired-end reads (see Data Preprocessing for more details). Using merged reads increases speed and accuracy.
Profile. Taxonomic profiles for each sample can be generated using mOTUs profile command. The output profile will consist of identified mOTUs and their abundance.
mkdir motus_profiles
motus profile -f reads/ERR479298_sub1_R1.fq.gz \
-r reads/ERR479298_sub1_R2.fq.gz \
-n ERR479298_sub1 -o motus_profiles/ERR479298_sub1.motus -c -k mOTU -q -p
motus profile -f reads/ERR479298_sub2_R1.fq.gz \
-r reads/ERR479298_sub2_R2.fq.gz \
-n ERR479298_sub2 -o motus_profiles/ERR479298_sub2.motus -c -k mOTU -q -p
|
input file(s) for reads in forward orientation |
|
input file(s) for reads in reverse orientation |
|
input file(s) for unpaired reads (singletons or merged pair end reads) |
|
sample name |
|
output file name |
|
print result as counts instead of relative abundances |
|
taxonomic level (kingdom, phylum, class, order, family, genus, mOTU) |
|
print the full rank taxonomy |
|
print NCBI taxonomy identifiers |
Important
Expect mOTU counts (when run with -c
option) to be relatively small (compared to total number of reads in your sample). The counts are proportional to the library size, and you can expect ~600 mOTU counts for 5,000,000 reads. If you still think you should be getting higher counts, please see FAQ for common issues.
Note
The unassigned at the end of the profile file represents the fraction of unmapped reads. This represents species that we know to be present in the sample, but we are not able to quantify individually; hence we group them together into an unassigned fraction. For almost all the analysis, it is better to remove this value, since it does not represent a single species/clade. Please see FAQ for more information.
Merge. Individual taxonomic profiles can be merged together using mOTUs merge command to facilitate downstream analysis.
motus merge -i motus_profiles/ERR479298_sub1.motus,motus_profiles/ERR479298_sub1.motus -o motus_profiles/merged.motus
|
list of mOTU profiles to merge |
|
output file name |
MAPseq
MAPseq is a fast and accurate taxonomic classification tool. Since it relies on rRNA sequences for profiling, it can be applied to both amplicon and metagenomic data.
Important
Workflow coming soon!