Data Preprocessing

Protocol provided by Hans-Joachim Ruscheweyh.

General Considerations

Data quality control is an essential first step in any bioinformatics workflow. Below we discuss recommended preprocessing steps for short read Illumina sequencing data. Broadly, these steps involve Illumina adapter removal, contaminant filtering and quality-trimming. Additional preprocessing steps, recommended only for specific workflows, are detailed in Other Considerations.

Important

This applies to (standard) Illumina short read data. Long read sequencing data from other technologies, or other library preparations from Illumina (e.g. Nextera Mate Pair Reads data) will require a different preprocessing protocol.

Note

Sample data for this section can be found here. The conda environment specifications are here. See the Tutorials section for instructions on how to unpack the data and create the conda environment. After unpacking the data, you should have a set of forward (Sample1_R1.fq.gz) and reverse (Sample1_R2.fq.gz) reads. Also included are Illumina adapter sequences (adapters.fa) and PhiX genome (phix174_ill.ref.fa.gz).

flowchart LR id1( Preprocessing) --> id2(adapter<br/>trimming<br/>fa:fa-cog BBTools BBDuk) id2 --> id3(contaminant<br/>filtering<br/>fa:fa-cog BBTools BBDuk) id3 --> id4(quality filtering/<br/>trimming<br/>fa:fa-cog BBTools BBDuk) classDef tool fill:#96D2E7,stroke:#F8F7F7,stroke-width:1px; style id1 fill:#5A729A,stroke:#F8F7F7,stroke-width:1px,color:#fff class id2,id3,id4 tool
  1. Adapter Trimming. The adapter sequences contain the sequencing primer binding sites, index sequences, and sequences that allow flow-cell binding. Unless removed, these can interfere with downstream analyses. For this and other preprocessing steps, we use BBTools, a set of tools developed by the Joint Genome Institute. Adapter trimming is performed using BBDuk. In this step, a FASTA file with Illumina adapter sequences is specified as reference, and BBDuk will perform k-mer matching to trim the adapter sequences from the reads. The example command is shown below.

    Example command
    bbduk.sh -Xmx1G usejni=t in=Sample1_R1.fq.gz in2=Sample1_R2.fq.gz \
    out=Sample1_trimmed_R1.fq.gz out2=Sample1_trimmed_R2.fq.gz \
    outm=Sample1_adapter_matched.fq.gz outs=Sample1_adapter_s.fq.gz  \
    refstats=Sample1.adapter_trim.stats statscolumns=5 overwrite=t ref=adapters.fa \
    ktrim=r k=23 mink=11 hdist=1 2>> preprocessing.log
    

Options Explained

-Xmx

This will be passed to Java to set memory usage. Xmx1G will set it to 1G.

usejni

Enable JNI-accelerated version of BBDuk.

ktrim

ktrim = r trims the adapter as well as all the bases to the right of the adapter sequence.

k

Length of the k-mer used for matching.

mink

Additionally matches shorter k-mers (with lengths between 23 and 11) to trim partial adapter sequences.

hdist

Hamming distance for reference k-mers. The Hamming distance describes the number of bases by which two DNA sequences differ.

outs

Write singleton reads whose mate has failed filters to this file.

  1. Contaminant removal. Spike-ins (most commonly PhiX) are usually used for quality control of sequencing runs as well as to ensure nucleotide diversity when sequencing low complexity libraries. We perform this filtering step prior to downstream analysis to be completely sure that these sequences are not be present in your data. Here we use BBDuk and PhiX genome is used as the reference.

    Example command

    bbduk.sh -Xmx1G usejni=t in=Sample1_trimmed_R1.fq.gz in2=Sample1_trimmed_R2.fq.gz \
    out=Sample1_phix_removed_R1.fq.gz out2=Sample1_phix_removed_R2.fq.gz \
    outm=Sample1_phix_matched.fq.gz outs=Sample1_phix_s.fq.gz \
    ref=phix174_ill.ref.fa.gz k=31 hdist=1 \
    refstats=Sample1_phix.stats statscolumns=5 2>> contaminant.log
    

Here, we use a different kmer size k=31 to remove Spike-ins. This is the recommended length by BBDuk to remove all reads that have a 31-mer match to the PhiX genome.

Note

High nucleotide diversity (i.e. equal relative proportions of A,C,G, and T in each cycle) is critical to the performance of Illumina sequencers. Low diversity (or low complexity) libraries, such as amplicon libraries, will have a large proportion of one nucleotide and small proportions of other nucleotides in a cycle. To compensate for low complexity, a PhiX DNA sequence is often added to the library. Different sequencers use different chemistry and image processing software and require different amounts of PhiX spike-in (anywhere between 5% and 50%). Check the latest information about your sequencing platform.

  1. Quality filtering and trimming. In this step we use BBDuk to trim low quality bases from the ends of the reads and filter reads based on length, average read quality, and number of Ns present.

    Example command

    bbduk.sh -Xmx1G usejni=t in=Sample1_phix_removed_R1.fq.gz in2=Sample1_phix_removed_R2.fq.gz  \
    out1=Sample1_clean_R1.fq.gz out2=Sample1_clean_R2.fq.gz \
    outm=Sample1_qc_failed.fq.gz outs=Sample1_s.fq.gz minlength=45 \
    qtrim=rl maq=20 maxns=1  stats=Sample1_qc.stats statscolumns=5 trimq=14 2>> qc.log
    

Options Explained

minlength=45

Filters out reads that are shorter than 45 bp.

qtrim=rl

Trims low quality bases on the right and left ends of the reads.

trimq=14

Regions with average quality BELOW 14 will be trimmed.

maq=20

Filters out reads with average quality BELOW 20.

maxns=1

Filters out reads with more than 1 N.

Note

Base quality scores (i.e. level of confidence for any one base call) are an integral part of many bioinformatics pipelines (i.e. alignment and variant calling). Quality scores are usually expressed on a Phred scale (\(Q=-10log_{10}P\), where P is the probability of an error in the base call). Base quality scores normally range somewhere between 2 and 40, where Q40 represents an error probability of 1/10000. More recently, Illumina started using binned quality scores. For example, NovaSeq (with RTA3) only produces 4 Q-scores: 2 is assigned to no-calls, 12 to calls <Q15, 23 to ~Q20 and 37 to >Q30. According to Illumina and in our hands, these binned quality scores did not affect the downstream analyses (i.e. variant calling).

All of the preprocessing commands can be piped together as follows:

bbduk.sh -Xmx1G usejni=t in=Sample1_R1.fq.gz in2=Sample1_R2.fq.gz \
out=stdout.fq outm=Sample1_adapter_matched.fq.gz outs=Sample1_adapter_s.fq.gz  \
refstats=Sample1.adapter_trim.stats statscolumns=5 overwrite=t ref=adapters.fa \
ktrim=r k=23 mink=11 hdist=1  2>> preprocessing.log | \
bbduk.sh -Xmx1G usejni=t interleaved=true overwrite=t \
in=stdin.fq out=stdout.fq outm=Sample1_phix_matched.fq.gz outs=Sample1_phix_s.fq.gz \
ref=phix174_ill.ref.fa.gz k=31 hdist=1 refstats=Sample1_phix.stats statscolumns=5 2>> preprocessing.log | \
bbduk.sh -Xmx1G usejni=t overwrite=t interleaved=true \
in=stdin.fq out1=Sample1_clean_R1.fq.gz out2=Sample1_clean_R2.fq.gz \
outm=Sample1_qc_failed.fq.gz outs=Sample1_s.fq.gz minlength=45 \
qtrim=rl maq=20 maxns=1  stats=Sample1_qc.stats statscolumns=5 trimq=14  2>> preprocessing.log;

Other Considerations

Below are some of the other preprocessing steps that are recommended for specific applications only. All of these steps will be performed on the clean reads produced by general preprocessing workflow outlined above.

Preprocessing Step

Recommended for

Tools

Filtering out host reads

Any samples containing host DNA

BBMap

Coverage normalization

Metagenomic assembly (very large samples only)

BBNorm

Paired-read merging

Metagenomic assembly, 16S and mOTUs profiling

BBMerge

Filtering out host reads

Samples containing host DNA can be filtered by mapping the reads to the host genome. This step is performed using BBMap aligner.

Note

As described in this post, simply mapping reads to host genome, might lead to false positives, i.e. reads that are bacterial in origin, but nevertheless mapped to host genome. Removal of these reads might have a negative effect on the quality of the assemblies. In case of human host, this can be avoided by using this masked genome. The masking procedure is described in the post linked above. However, this is not available for other host genomes. Unmasked references can be downloaded from NCBI, Ensembl, UCSC. Be sure to keep track of the genome version you are using. Genomes for commonly analyzed organisms can also be downloaded from Illumina iGenomes.

Example Command

bbmap.sh -Xmx23g usejni=t threads=20 overwrite=t qin=33 minid=0.95 maxindel=3 bwr=0.16 bw=12 quickmatch fast \
minhits=2 path=host_bbmap_ref qtrim=rl trimq=15 untrim in1=in.1.fq.gz in2=in.2.fq.gz outu1=out.1.fq.gz \
outu2=out.2.fq.gz outm=out.host.matched.fq.gz 2>> removeHost.log

This step has to be repeated for singleton sequences generated in the QC step:

bbmap.sh -Xmx23g usejni=t threads=24 overwrite=t qin=33 minid=0.95 maxindel=3 \
bwr=0.16 bw=12 quickmatch fast    minhits=2 \
path=host_bbmap_ref qtrim=rl trimq=15 untrim in=in.s.fq.gz outu=out.s.fq.gz \
outm=out.s.host.matched.fq.gz 2>> out.rmHost.log

qin

Set to 33 or 64 to specify input quality value ASCII offset. 33 is Sanger, 64 is old Solexa. Could be left unspecified (default=auto).

minid

Approximate minimum alignment identity to look for.

maxindel

Don’t look for indels longer than this. Lower is faster.

bwr

If above zero, restrict alignment band to this fraction of read length. Faster but less accurate.

bw

Set the bandwidth directly.

qickmatch

Generate cigar strings more quickly.

fast

Sets other paramters to run faster, at reduced sensitivity.

minhits

Minimum number of seed hits required for candidate sites.

path

Specify the location to write the index.

qtrim

Quality-trim ends before mapping.

trimq

Trim regions with average quality below this.

untrim

Undo trimming after mapping.

in

Primary reads input.

outu

Write only unmapped reads to this file.

outm

Write only mapped reads, that fail filters to this file.

Important

This command will NOT remove all of the host sequences from your sample. The main puropose of the host removal as described here, is to improve metagenome assemblies, not to eliminate all of the host sequences, i.e. if you’re working with humand data, some human reads might still be present in your samples.

Normalization

This step normalizes the coverage by down-sampling reads over high-coverage areas. This step is only necessary for very large metagenomic samples in order to make the assembly computationally tractable. An example using BBNorm is shown below. As above this step needs to be repeated for the singletons.

Example Command

bbnorm.sh -Xmx{memory_limit}G threads={threads} extra=s.fq.gz in1=r1.fq.gz \
in2=r2.fq.gz out1=output_1.fq.gz out2=output_2.fq.gz target=40 mindepth=0 hist=output.hist \
peaks=output.peaks &> pe_norm.log; \

bbnorm.sh -Xmx{memory_limit}G threads={threads} extra=r1.fq.gz,r2.fq.gz \
in=s.fq.gz out=output_s.fq.gz target=40 mindepth=0 hist=output.hist2 \
peaks=output.peaks2 &> s_norm.log

-Xmx

This will be passed to Java to set memory usage.

threads

Set to number of threads desired.

extra

For the kmer table: Additional files to use for input, but not for output.

in1

Path to the forward reads.

in2

Path to the reverse reads.

out1

Normalized forward reads.

out2

Normalized reverse reads.

target

Target normalization depth.

mindepth

Kmers with depth below this number will not be included when calculating the depth of a read.

hist

Specify a file to write the input kmer depth histogram.

peaks

Write the peaks to this file.

Pair-read Merging

Merging refers to merging two overlapping reads into one. This is recommended for amplicon data, mOTUs profiling and metagenomic assembly. We do not usually merge the reads for isolate genome assembly. This can be done using BBMerge.

Example Command
bbmerge.sh -Xmx32G threads=32 in1=Sample1_R1.fq.gz in2=Sample1_R2.fq.gz out=Sample1.m.fq.gz \
outu1=Sample1.merge.R1.fq.gz outu2=Sample1.merge.R2.fq.gz minoverlap=16 usejni=t \
ihist=Sample1.merge.hist &> merge.log

-Xmx

This will be passed to Java to set memory usage.

threads

Set to number of threads desired.

in1

Path to the forward reads.

in2

Path to the reverse reads.

out

File for merged reads.

outu1

File for forward unmerged reads.

outu2

File for reverse unmerged reads.

minoverlap

Minimum number of overlapping bases to allow merging.

ihist

Insert length histogram output file.

usejni

Do overlapping in C code, which is faster. Requires compiling the C code.