start:software:topical:lifesciences:ngs:qc

# Quality Check (Adapter Trimming & Quality Filter) of NGS Data Sets

Another frequently used software is cutadapt. Modules are available on both clusters as:

You can find a wrapper to ease your workflow, below.

As cutadapt is pretty slow, it is not supported by the wrapper module on MOGON II

flexbar has been shown to be versatile and fast NGS preprocessing application.

It is available on MOGON as the module bio/flexbar and you can find its manual on the web.

Trimmomatic, being a JAVA application, requires some preparations to run reliably. It is provided in modules1):

bio/Trimmomatic

$module load bio/Trimmomatic/0.36-Java-1.8.0_162 To execute Trimmomatic run: java -jar$EBROOTTRIMMOMATIC/trimmomatic-0.36.ja

This is true and not sufficient: In prolonged runs Trimmomatic, resp. JAVA tends to allocate more memory, than presumably allowed in a single job step.

Therefore we suggest to define:

export MALLOC_ARENA_MAX=4
# calculate the memory per process:
task_cpus=$((SLURM_NTASKS * SLURM_CPUS_PER_TASK)) mem_per_process=$((SLURM_MEM_PER_NODE / task_cpus))
# and start Trimmomatic like:
srun java -Xmx${mem_per_process}M -jar$EBROOTTRIMMOMATIC/trimmomatic-0.36.jar ...

Also, an adaptor can be supllied as:

ILLUMINACLIP:$EBROOTTRIMMOMATIC/adaptors/TruSeq3-PE.fa:2:30:10 The figures are merely included for demonstration purposes, the important part is the access of the adaptor file, which would otherwise be searched for locally:$EBROOTTRIMMOMATIC/adaptors/TruSeq3-PE.fa2)

To leverage the task from 1 (or a few) samples to be trimmed to several in parallel, we provide a wrapper script, which is available as a module:

bio/parallel_QCTools

The code is under version management and hosted internally, here.

The wrapper script will submit a job, it is not intended to be just within a SLURM environment, but rather creates one.

Calling QCWrapper -h will display a help message with all the options, the script provides. Likewise, the call QCWrapper will display credits and a version history.

#### Limitations

• The wrapper recognizes FASTQ files with suffixes “*.gz”, “*.fastq” or “*.fq” and will allways assume FASTQ files.
• The number of processes (and therefore nodes) is limited to the number of samples.
• The wrapper only works for paired end sequencing data, where the file tuples are designated with the following strings “_1” and “_2” or “_R1” and “_R2”, respectively.

• readdir can be relative path to the top-level directory containing FASTQ tuples.

The options:

• –executable, mandatory argument to designate the executable
1. possible arguments: cutadapt, flexbar, trimmomatic
2. check is case insensitive
3. defaults to 'flexbar'
• -l,–runlimit, this defaults to 300 minutes.
• -p,–partition, the default is nodeshort or parallel on MOGON II
• -A,–account, SLURM account
1. default is the last submit account
2. an error is triggered if none specified nor can be deduced
• -t,–threads, number of threads the executable should use (defaults are application dependend)
• -a,–args, arguments otherwise not set by the wrapper
1. the defaults of the choosen executable apply for unset arguments
2. will superseed the defaults, e.g. 'LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36' for trimmomatic
• -d,–dependency, list of comma separated jobids, the job will wait for to finish
• -o,–outdir output directory path (default is the current working directory)
• -a,–adapter, a selection of one of Trimmomatics pre-defined adapters, default to 'TruSeq3-PE.fa'
1. a selection of one of Trimmomatics pre-defined adapters, defaults to TruSeq3-PE.fa else
2. an adaptor string specification according to the selected software.
3. defaults to 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' (adaptor 1 of TruSeq3-PE.fa)
1. if the excecutable is trimmomatic, this argument is not necessara (it is contained in the global adaptor selection)
2. if the excecutable is cutadapt, this arguments defaults to 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT' (the mate of TruSeq3-PE.fa)
3. if the excecutable is flexbar, this argument is not necessary (it is contained in the global adaptor selection)
• –single, if given, single end data will be assumed, otherwise: paired-end data are default
• flexbar specific
• –qtrim, see '–qtrim' option of flexbar, default to 'WIN'
• –qtrim-format, see -qf/–qtrim-format option of flexbar, default is 'i1.5'
• –constraint, only on MOGON II, defaults to 'broadwell'
• –tag a jobtag (default is decuced by naming scheme)
• –credits shows credits and version history
• –version shows the version number
• -h,–help Prints help

The output naming scheme:

Within the specified (or default) output directory, you will find back your sample subdirectories (if any were present. The prefix of each sample is preserved. As the wrapper allows only certain designators to distinguish the mate pairs (see the limitations, above), these are also preserved. Trimmomatic splits it output in reads which are paired and unpaired (if any). The later are written in a subdirectory unpaired as of version 0.2.

#### Note

The figure below is not the final evaluation - this is forthcoming. I assume that the assumption will hold, that flexbar outperform trimmomatic.

Selecting the executable should consider the following 3 minimum criteria:

1. Speed as indicated in the left figure
2. The memory footprint, which is negligible except for trimmomatic, where on MOGON II a higher reservation, than the default memory was necessary to implement.
3. Quality. Here, flexbar is the most feature rich, but also the most complex.

1)