Quality Check (Adapter Trimming & Quality Filter) of NGS Data Sets
Software Options
Cutadapt
Another frequently used software is cutadapt. Modules are available on both clusters as:
bio/cutadapt
You can find a wrapper to ease your workflow, below.
flexbar
flexbar has been shown to be versatile and fast NGS preprocessing application.
It is available on MOGON as the module bio/flexbar
and you can find its manual on the web.
Trimmomatic
Trimmomatic, being a JAVA application, requires some preparations to run reliably. It is provided in modules1):
bio/Trimmomatic
When loading a Trimmomatic module a message is printed:
$ module load bio/Trimmomatic/0.36-Java-1.8.0_162 To execute Trimmomatic run: java -jar $EBROOTTRIMMOMATIC/trimmomatic-0.36.ja
This is true and not sufficient: In prolonged runs Trimmomatic, resp. JAVA tends to allocate more memory, than presumably allowed in a single job step.
Therefore we suggest to define:
export MALLOC_ARENA_MAX=4 # calculate the memory per process: task_cpus=$((SLURM_NTASKS * SLURM_CPUS_PER_TASK)) mem_per_process=$((SLURM_MEM_PER_NODE / task_cpus)) # and start Trimmomatic like: srun java -Xmx${mem_per_process}M -jar $EBROOTTRIMMOMATIC/trimmomatic-0.36.jar ...
Also, an adaptor can be supllied as:
ILLUMINACLIP:$EBROOTTRIMMOMATIC/adaptors/TruSeq3-PE.fa:2:30:10
The figures are merely included for demonstration purposes, the important part is the access of the adaptor file, which would otherwise be searched for locally: $EBROOTTRIMMOMATIC/adaptors/TruSeq3-PE.fa
2)
The Wrapper Module on Mogon
To leverage the task from 1 (or a few) samples to be trimmed to several in parallel, we provide a wrapper script, which is available as a module:
bio/parallel_QCTools
The code is under version management and hosted internally, here.
Calling QCWrapper -h
will display a help message with all the options, the script provides. Likewise, the call QCWrapper
will display credits and a version history.
The script, after loading the module, can then be run like:
$ QCWrapper [options] <readdir>
Limitations
- The wrapper recognizes FASTQ files with suffixes “
*.gz
”, “*.fastq
” or “*.fq
” and will allways assume FASTQ files. - The number of processes (and therefore nodes) is limited to the number of samples.
- The wrapper only works for paired end sequencing data, where the file tuples are designated with the following strings “
_1
” and “_2
” or “_R1
” and “_R2
”, respectively.
About Arguments:
readdir
can be relative path to the top-level directory containing FASTQ tuples.
The options:
–executable
, mandatory argument to designate the executable- possible arguments: cutadapt, flexbar, trimmomatic
- check is case insensitive
- defaults to 'flexbar'
-l,–runlimit
, this defaults to 300 minutes.-p,–partition
, the default isnodeshort
orparallel
on MOGON II-A,–account
, SLURM account- default is the last submit account
- an error is triggered if none specified nor can be deduced
-t,–threads
, number of threads the executable should use (defaults are application dependend)
-a,–args
, arguments otherwise not set by the wrapper- the defaults of the choosen executable apply for unset arguments
- will superseed the defaults, e.g. 'LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36' for trimmomatic
-d,–dependency
, list of comma separated jobids, the job will wait for to finish-o,–outdir
output directory path (default is the current working directory)-a,–adapter
, a selection of one of Trimmomatics pre-defined adapters, default to 'TruSeq3-PE.fa'- a selection of one of Trimmomatics pre-defined adapters, defaults to
TruSeq3-PE.fa
else - an adaptor string specification according to the selected software.
- defaults to 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' (adaptor 1 of TruSeq3-PE.fa)
–adapterp
, adaptor for the mate pair- if the excecutable is trimmomatic, this argument is not necessara (it is contained in the global adaptor selection)
- if the excecutable is cutadapt, this arguments defaults to 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT' (the mate of TruSeq3-PE.fa)
- if the excecutable is flexbar, this argument is not necessary (it is contained in the global adaptor selection)
–single
, if given, single end data will be assumed, otherwise: paired-end data are default- flexbar specific
–qtrim
, see '–qtrim' option of flexbar, default to 'WIN'–qtrim-format
, see-qf/–qtrim-format
option of flexbar, default is 'i1.5'
–constraint
, only on MOGON II, defaults to 'broadwell'–tag
a jobtag (default is decuced by naming scheme)–credits
shows credits and version history–version
shows the version number-h,–help
Prints help
The output naming scheme:
Within the specified (or default) output directory, you will find back your sample subdirectories (if any were present. The prefix of each sample is preserved. As the wrapper allows only certain designators to distinguish the mate pairs (see the limitations, above), these are also preserved. Trimmomatic splits it output in reads which are paired and unpaired (if any). The later are written in a subdirectory unpaired
as of version 0.2
.
Selecting the Executable
Note
The figure below is not the final evaluation - this is forthcoming. I assume that the assumption will hold, that flexbar outperform trimmomatic.
Selecting the executable should consider the following 3 minimum criteria:
- Speed as indicated in the left figure
- The memory footprint, which is negligible except for trimmomatic, where on MOGON II a higher reservation, than the default memory was necessary to implement.
- Quality. Here, flexbar is the most feature rich, but also the most complex.