Data Preprocessingο
General Considerationsο
Data quality control is an essential first step in any bioinformatics workflow. Below we discuss recommended preprocessing steps for short read Illumina sequencing data. Broadly, these steps involve Illumina adapter removal, contaminant filtering and quality-trimming. Additional preprocessing steps, recommended only for specific workflows, are detailed in Other Considerations.
Important
This applies to (standard) Illumina short read data. Long read sequencing data from other technologies, or other library preparations from Illumina (e.g. Nextera Mate Pair Reads data) will require a different preprocessing protocol.
Note
Sample data for this section can be found here
. The conda environment specifications are here
. See the Tutorials section for instructions on how to unpack the data and create the conda environment. After unpacking the data, you should have a set of forward (Sample1_R1.fq.gz) and reverse (Sample1_R2.fq.gz) reads. Also included are Illumina adapter sequences (adapters.fa) and PhiX genome (phix174_ill.ref.fa.gz).
Adapter Trimming. The adapter sequences contain the sequencing primer binding sites, index sequences, and sequences that allow flow-cell binding. Unless removed, these can interfere with downstream analyses. For this and other preprocessing steps, we use BBTools, a set of tools developed by the Joint Genome Institute. Adapter trimming is performed using BBDuk. In this step, a FASTA file with Illumina adapter sequences is specified as reference, and BBDuk will perform k-mer matching to trim the adapter sequences from the reads. The example command is shown below.
- Example command
bbduk.sh -Xmx1G usejni=t in=Sample1_R1.fq.gz in2=Sample1_R2.fq.gz \ out=Sample1_trimmed_R1.fq.gz out2=Sample1_trimmed_R2.fq.gz \ outm=Sample1_adapter_matched.fq.gz outs=Sample1_adapter_s.fq.gz \ refstats=Sample1.adapter_trim.stats statscolumns=5 overwrite=t ref=adapters.fa \ ktrim=r k=23 mink=11 hdist=1 2>> preprocessing.log
Options Explained
|
This will be passed to Java to set memory usage. |
|
Enable JNI-accelerated version of BBDuk. |
|
|
|
Length of the k-mer used for matching. |
|
Additionally matches shorter k-mers (with lengths between 23 and 11) to trim partial adapter sequences. |
|
Hamming distance for reference k-mers. The Hamming distance describes the number of bases by which two DNA sequences differ. |
|
Write singleton reads whose mate has failed filters to this file. |
Note
Contaminant removal. Spike-ins (most commonly PhiX) are usually used for quality control of sequencing runs as well as to ensure nucleotide diversity when sequencing low complexity libraries. We perform this filtering step prior to downstream analysis to be completely sure that these sequences are not be present in your data. Here we use BBDuk and PhiX genome is used as the reference.
Example command
bbduk.sh -Xmx1G usejni=t in=Sample1_trimmed_R1.fq.gz in2=Sample1_trimmed_R2.fq.gz \ out=Sample1_phix_removed_R1.fq.gz out2=Sample1_phix_removed_R2.fq.gz \ outm=Sample1_phix_matched.fq.gz outs=Sample1_phix_s.fq.gz \ ref=phix174_ill.ref.fa.gz k=31 hdist=1 \ refstats=Sample1_phix.stats statscolumns=5 2>> contaminant.log
Here, we use a different kmer size k=31
to remove Spike-ins. This is the recommended length by BBDuk to remove all reads that have a 31-mer match to the PhiX genome.
Note
High nucleotide diversity (i.e. equal relative proportions of A,C,G, and T in each cycle) is critical to the performance of Illumina sequencers. Low diversity (or low complexity) libraries, such as amplicon libraries, will have a large proportion of one nucleotide and small proportions of other nucleotides in a cycle. To compensate for low complexity, a PhiX DNA sequence is often added to the library. Different sequencers use different chemistry and image processing software and require different amounts of PhiX spike-in (anywhere between 5% and 50%). Check the latest information about your sequencing platform.
Quality filtering and trimming. In this step we use BBDuk to trim low quality bases from the ends of the reads and filter reads based on length, average read quality, and number of Ns present.
Example command
bbduk.sh -Xmx1G usejni=t in=Sample1_phix_removed_R1.fq.gz in2=Sample1_phix_removed_R2.fq.gz \ out1=Sample1_clean_R1.fq.gz out2=Sample1_clean_R2.fq.gz \ outm=Sample1_qc_failed.fq.gz outs=Sample1_s.fq.gz minlength=45 \ qtrim=rl maq=20 maxns=1 stats=Sample1_qc.stats statscolumns=5 trimq=14 2>> qc.log
Options Explained
|
Filters out reads that are shorter than 45 bp. |
|
Trims low quality bases on the right and left ends of the reads. |
|
Regions with average quality BELOW 14 will be trimmed. |
|
Filters out reads with average quality BELOW 20. |
|
Filters out reads with more than 1 N. |
Note
Base quality scores (i.e. level of confidence for any one base call) are an integral part of many bioinformatics pipelines (i.e. alignment and variant calling). Quality scores are usually expressed on a Phred scale (\(Q=-10log_{10}P\), where P is the probability of an error in the base call). Base quality scores normally range somewhere between 2 and 40, where Q40 represents an error probability of 1/10000. More recently, Illumina started using binned quality scores. For example, NovaSeq (with RTA3) only produces 4 Q-scores: 2 is assigned to no-calls, 12 to calls <Q15, 23 to ~Q20 and 37 to >Q30. According to Illumina and in our hands, these binned quality scores did not affect the downstream analyses (i.e. variant calling).
All of the preprocessing commands can be piped together as follows:
bbduk.sh -Xmx1G usejni=t in=Sample1_R1.fq.gz in2=Sample1_R2.fq.gz \
out=stdout.fq outm=Sample1_adapter_matched.fq.gz outs=Sample1_adapter_s.fq.gz \
refstats=Sample1.adapter_trim.stats statscolumns=5 overwrite=t ref=adapters.fa \
ktrim=r k=23 mink=11 hdist=1 2>> preprocessing.log | \
bbduk.sh -Xmx1G usejni=t interleaved=true overwrite=t \
in=stdin.fq out=stdout.fq outm=Sample1_phix_matched.fq.gz outs=Sample1_phix_s.fq.gz \
ref=phix174_ill.ref.fa.gz k=31 hdist=1 refstats=Sample1_phix.stats statscolumns=5 2>> preprocessing.log | \
bbduk.sh -Xmx1G usejni=t overwrite=t interleaved=true \
in=stdin.fq out1=Sample1_clean_R1.fq.gz out2=Sample1_clean_R2.fq.gz \
outm=Sample1_qc_failed.fq.gz outs=Sample1_s.fq.gz minlength=45 \
qtrim=rl maq=20 maxns=1 stats=Sample1_qc.stats statscolumns=5 trimq=14 2>> preprocessing.log;
Other Considerationsο
Below are some of the other preprocessing steps that are recommended for specific applications only. All of these steps will be performed on the clean reads produced by general preprocessing workflow outlined above.
Preprocessing Step |
Recommended for |
Tools |
---|---|---|
Filtering out host reads |
Any samples containing host DNA |
BBMap |
Coverage normalization |
Metagenomic assembly (very large samples only) |
BBNorm |
Paired-read merging |
Metagenomic assembly, 16S and mOTUs profiling |
BBMerge |
Filtering out host readsο
Samples containing host DNA can be filtered by mapping the reads to the host genome. This step is performed using BBMap aligner.
Note
Host genome sequences are not provided in the test dataset, but can be downloaded from NCBI, Ensembl, UCSC. Be sure to keep track of the genome version you are using. Genomes for commonly analyzed organisms can also be downloaded from Illumina iGenomes.
Example Command
bbmap.sh -Xmx23g usejni=t threads=20 overwrite=t qin=33 minid=0.95 maxindel=3 bwr=0.16 bw=12 quickmatch fast \ minhits=2 path=host_bbmap_ref qtrim=rl trimq=15 untrim in1=in.1.fq.gz in2=in.2.fq.gz outu1=out.1.fq.gz \ outu2=out.2.fq.gz outm=out.host.matched.fq.gz 2>> removeHost.logThis step has to be repeated for singleton sequences generated in the QC step:
bbmap.sh -Xmx23g usejni=t threads=24 overwrite=t qin=33 minid=0.95 maxindel=3 \ bwr=0.16 bw=12 quickmatch fast minhits=2 \ path=host_bbmap_ref qtrim=rl trimq=15 untrim in=in.s.fq.gz outu=out.s.fq.gz \ outm=out.s.host.matched.fq.gz 2>> out.rmHost.log
|
Set to 33 or 64 to specify input quality value ASCII offset. 33 is Sanger, 64 is old Solexa. Could be left unspecified (default=auto). |
|
Approximate minimum alignment identity to look for. |
|
Donβt look for indels longer than this. Lower is faster. |
|
If above zero, restrict alignment band to this fraction of read length. Faster but less accurate. |
|
Set the bandwidth directly. |
|
Generate cigar strings more quickly. |
|
Sets other paramters to run faster, at reduced sensitivity. |
|
Minimum number of seed hits required for candidate sites. |
|
Specify the location to write the index. |
|
Quality-trim ends before mapping. |
|
Trim regions with average quality below this. |
|
Undo trimming after mapping. |
|
Primary reads input. |
|
Write only unmapped reads to this file. |
|
Write only mapped reads, that fail filters to this file. |
Normalizationο
This step normalizes the coverage by down-sampling reads over high-coverage areas. This step is only necessary for very large metagenomic samples in order to make the assembly computationally tractable. An example using BBNorm is shown below. As above this step needs to be repeated for the singletons.
Example Command
bbnorm.sh -Xmx{memory_limit}G threads={threads} extra=s.fq.gz in1=r1.fq.gz \ in2=r2.fq.gz out1=output_1.fq.gz out2=output_2.fq.gz target=40 mindepth=0 hist=output.hist \ peaks=output.peaks &> pe_norm.log; \ bbnorm.sh -Xmx{memory_limit}G threads={threads} extra=r1.fq.gz,r2.fq.gz \ in=s.fq.gz out=output_s.fq.gz target=40 mindepth=0 hist=output.hist2 \ peaks=output.peaks2 &> s_norm.log
|
This will be passed to Java to set memory usage. |
|
Set to number of threads desired. |
|
For the kmer table: Additional files to use for input, but not for output. |
|
Path to the forward reads. |
|
Path to the reverse reads. |
|
Normalized forward reads. |
|
Normalized reverse reads. |
|
Target normalization depth. |
|
Kmers with depth below this number will not be included when calculating the depth of a read. |
|
Specify a file to write the input kmer depth histogram. |
|
Write the peaks to this file. |
Pair-read Mergingο
Merging refers to merging two overlapping reads into one. This is recommended for amplicon data, mOTUs profiling and metagenomic assembly. We do not usually merge the reads for isolate genome assembly. This can be done using BBMerge .
- Example Command
bbmerge.sh -Xmx32G threads=32 in1=Sample1_R1.fq.gz in2=Sample1_R2.fq.gz out=Sample1.m.fq.gz \ outu1=Sample1.merge.R1.fq.gz outu2=Sample1.merge.R2.fq.gz minoverlap=16 usejni=t \ ihist=Sample1.merge.hist &> merge.log
|
This will be passed to Java to set memory usage. |
|
Set to number of threads desired. |
|
Path to the forward reads. |
|
Path to the reverse reads. |
|
File for merged reads. |
|
File for forward unmerged reads. |
|
File for reverse unmerged reads. |
|
Minimum number of overlapping bases to allow merging. |
|
Insert length histogram output file. |
|
Do overlapping in C code, which is faster. Requires compiling the C code. |