User Tools

Site Tools


software:topical:lifescience:qc

This is an old revision of the document!


Quality Check (Adapter Trimming & Quality Filter) of NGS Data Sets

Software Options

Cutadapt

Another frequently used software is cutadapt. Modules are available on both clusters as:

bio/cutadapt

You can find a wrapper to ease your workflow, below.

As cutadapt is pretty slow, it is not supported by the wrapper module on Mogon II

flexbar

flexbar has been shown to be versatile and fast NGS preprocessing application.

It is available on Mogon as the module bio/flexbar and you can find its manual on the web.

below.

Trimmomatic

Trimmomatic, being a JAVA application, requires some preparations to run reliably. It is provided in modules1):

bio/Trimmomatic

When loading a Trimmomatic module a message is printed:

$ module load bio/Trimmomatic/0.36-Java-1.8.0_162
To execute Trimmomatic run: java -jar $EBROOTTRIMMOMATIC/trimmomatic-0.36.ja

This is true and not sufficient: In prolonged runs Trimmomatic, resp. JAVA tends to allocate more memory, than presumably allowed in a single job step.

Therefore we suggest to define:

export MALLOC_ARENA_MAX=4
# calculate the memory per process:
task_cpus=$((SLURM_NTASKS * SLURM_CPUS_PER_TASK))
mem_per_process=$((SLURM_MEM_PER_NODE / task_cpus))
# and start Trimmomatic like:
srun java -Xmx${mem_per_process}M -jar $EBROOTTRIMMOMATIC/trimmomatic-0.36.jar ...

Also, an adaptor can be supllied as:

ILLUMINACLIP:$EBROOTTRIMMOMATIC/adaptors/TruSeq3-PE.fa:2:30:10

The figures are merely included for demonstration purposes, the important part is the access of the adaptor file, which would otherwise be searched for locally: $EBROOTTRIMMOMATIC/adaptors/TruSeq3-PE.fa2)

below.

The Wrapper Module on Mogon

To leverage the task from 1 (or a few) samples to be trimmed to several in parallel, we provide a wrapper script, which is available as a module:

bio/parallel_QCTools

The code is under version management and hosted internally, here.

The wrapper script will submit a job, it is not intended to be just within a SLURM environment, but rather creates one.

Calling QCWrapper -h will display a help message with all the options, the script provides. Likewise, the call QCWrapper will display credits and a version history.

The script, after loading the module, can then be run like:

$ QCWrapper [options] <readdir>

Limitations:

  • The wrapper recognizes FASTQ files with suffixes “*.gz”, “*.fastq” or “*.fq” and will allways assume FASTQ files.
  • The number of processes (and therefore nodes) is limited to the number of samples.
  • The wrapper only works for paired end sequencing data, where the file tuples are designated with the following strings “_1” and “_2” or “_R1” and “_R2”, respectively.

About Arguments:

  • readdir can be relative path to the top-level directory containing FASTQ tuples.

The options:

  • –executable, mandatory argument to designate the executable
    1. possible arguments: cutadapt, flexbar, trimmomatic
    2. check is case insensitive
    3. defaults to 'flexbar'
  • -l,–runlimit, this defaults to 300 minutes.
  • -p,–partition, the default is nodeshort or parallel on Mogon2
  • -A,–account, SLURM account
    1. default is the last submit account
    2. an error is triggered if none specified nor can be deduced
  • -t,–threads, number of threads the executable should use (defaults are application dependend)
  • -a,–args, arguments otherwise not set by the wrapper
    1. the defaults of the choosen executable apply for unset arguments
    2. will superseed the defaults, e.g. 'LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36' for trimmomatic
  • -d,–dependency, list of comma separated jobids, the job will wait for to finish
  • -o,–outdir output directory path (default is the current working directory)
  • -a,–adapter, a selection of one of Trimmomatics pre-defined adapters, default to 'TruSeq3-PE.fa'
    1. a selection of one of Trimmomatics pre-defined adapters, defaults to TruSeq3-PE.fa else
    2. an adaptor string specification according to the selected software.
    3. defaults to 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' (adaptor 1 of TruSeq3-PE.fa)
  • –adapterp, adaptor for the mate pair
    1. if the excecutable is trimmomatic, this argument is not necessara (it is contained in the global adaptor selection)
    2. if the excecutable is cutadapt, this arguments defaults to 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT' (the mate of TruSeq3-PE.fa)
    3. if the excecutable is flexbar, this argument is not necessary (it is contained in the global adaptor selection)
  • –single, if given, single end data will be assumed, otherwise: paired-end data are default
  • flexbar specific
    • –qtrim, see '–qtrim' option of flexbar, default to 'WIN'
    • –qtrim-format, see -qf/–qtrim-format option of flexbar, default is 'i1.5'
  • –constraint, only on Mogon2, defaults to 'broadwell'
  • –tag a jobtag (default is decuced by naming scheme)
  • –credits shows credits and version history
  • –version shows the version number
  • -h,–help Prints help

The output naming scheme:

Within the specified (or default) output directory, you will find back your sample subdirectories (if any were present. The prefix of each sample is preserved. As the wrapper allows only certain designators to distinguish the mate pairs (see the limitations, above), these are also preserved. Trimmomatic splits it output in reads which are paired and unpaired (if any). The later are written in a subdirectory unpaired as of version 0.2.

1)
loading a module without version specification will load the most recent one
2)
Which adaptorfile you pick is, of course, project dependent.
software/topical/lifescience/qc.1548681361.txt.gz · Last modified: 2019/01/28 14:16 by meesters