software:topical:lifescience:qc

This is an old revision of the document!


Quality Check (Adapter Trimming & Quality Filter) of NGS Data Sets

Another frequently used software is cutadapt. Modules are available on both clusters as:

bio/cutadapt

flexbar has been shown to be versatile and fast NGS preprocessing application.

It is available on Mogon as the module bio/flexbar and you can find its manual on the web.

Trimmomatic, being a JAVA application, requires some preparations to run reliably. It is provided in modules1):

bio/Trimmomatic

When loading a Trimmomatic module a message is printed:

$ module load bio/Trimmomatic/0.36-Java-1.8.0_162
To execute Trimmomatic run: java -jar $EBROOTTRIMMOMATIC/trimmomatic-0.36.ja

This is true and not sufficient: In prolonged runs Trimmomatic, resp. JAVA tends to allocate more memory, than presumably allowed in a single job step.

Therefore we suggest to define:

export MALLOC_ARENA_MAX=4
# calculate the memory per process:
task_cpus=$((SLURM_NTASKS * SLURM_CPUS_PER_TASK))
mem_per_process=$((SLURM_MEM_PER_NODE / task_cpus))
# and start Trimmomatic like:
srun java -Xmx${mem_per_process}M -jar $EBROOTTRIMMOMATIC/trimmomatic-0.36.jar ...

Also, an adaptor can be supllied as:

ILLUMINACLIP:$EBROOTTRIMMOMATIC/adaptors/TruSeq3-PE.fa:2:30:10

The figures are merely included for demonstration purposes, the important part is the access of the adaptor file, which would otherwise be searched for locally: $EBROOTTRIMMOMATIC/adaptors/TruSeq3-PE.fa2)

To leverage the task from 1 (or a few) samples to be trimmed to several in parallel, we provide a wrapper script, which is available as a module:

bio/parallel_Trimmomatic

The code is under version management and hosted internally, here.

The wrapper script will submit a job, it is not intended to be just within a SLURM environment, but rather creates one.

Calling parallel_Trimmomatic -h will display a help message with all the options, the script provides. Likewise, the call parallel_Trimmomatic will display credits and a version history.

The script, after loading the module, can then be run like:

$ parallel_Trimmomatic [options] <readdir>

Limitations:

  • The wrapper recognizes FASTQ files with suffixes “*.gz”, “*.fastq” or “*.fq” and will allways assume FASTQ files.
  • The number of processes (and therefore nodes) is limited to the number of samples.
  • The wrapper only works for paired end sequencing data, where the file tuples are designated with the following strings “_1” and “_2” or “_R1” and “_R2”, respectively.

About Arguments:

  • readdir can be relative path to the top-level directory containing FASTQ tuples.

The options:

  • parallel_Trimmomatic attempts to deduce your SLURM account. This may fail, in which case -A, –account needs to be supplied.
  • -N,–nodes allows to reserve more than 1 node (the default). This may speed up the screening; see the limitations above.
  • -d,–dependency, list of comma separated jobids, the job will wait for to finish
  • -l,–runlimit, this defaults to 300 minutes.
  • -p,–partition, the default is nodeshort or parallel on Mogon2, no smp-partition should be choosen.
  • -t,–threads, Trimmomatic can work in parallel. Please consult the manual. The default is 2.
  • -o,–outdir output directory path (default is the current working directory)
  • -a,–adapter: a selection of one of Trimmomatics pre-defined adapters, default to 'TruSeq3-PE.fa'
  • –options: a string of trimmomatic options, superseeds the defaults: 'LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36'
  • –constraint: only on Mogon2, defaults to 'anyarch'

The output naming scheme:

Within the specified (or default) output directory, you will find back your sample subdirectories (if any were present. The prefix of each sample is preserved. As the wrapper allows only certain designators to distinguish the mate pairs (see the limitations, above), these are also preserved. Trimmomatic splits it output in reads which are paired and unpaired (if any). The later are written in a subdirectory unpaired as of version 0.2.


1)
loading a module without version specification will load the most recent one
2)
Which adaptorfile you pick is, of course, project dependent.
  • software/topical/lifescience/qc.1548661257.txt.gz
  • Last modified: 2019/01/28 08:40
  • by meesters