This is an old revision of the document!
NGS Read Mapping Software on Mogon
As a first introduction into NGS alignment software tools we recommend reading this short blog post. Or in other words: It might be, that the list of supported tools grows and grows, due to your requests, but will never really cover everybody's favorite tool - there are just too many and some are just not worth having.
Software Options
BWA
BWA is one mapping tool, particularly to map “low-divergent sequences against a large reference genome”. Modules on Mogon can be found as1):
bio/BWA/<version>
You can find a wrapper to ease your workflow, below.
BarraCuda
Barracuda is a GPU-accelerated implementation of BWA and can be found on Mogon as the module
bio/barracuda
It does not support bwa mem …
but rather leverages bwa aln …
to GPUs.
See below for a wrapper script to ease your workflow.
RazerS 3
RazerS 3 as yara is part of the seqan modules:
bio/SeqAn/<version>
You can find a wrapper to ease your workflow, below.
Bowtie2
Bowtie2 is a well known read aligner with a focus on gapped alignments.
Module(s) can be found at:
bio/Bowtie2/<version>
You can find a wrapper to ease your workflow, below.
STAR
STAR is a well known mapping tool for RNA-Seq data.
Module(s) can be found at:
bio/STAR/<version>
You can find a wrapper to ease your workflow, below.
segemehl
segemehl seems to be a pretty good alignment tool, mentioned here, due to the blog which is cited below.
There will be no wrapper script for segemehl
: If this comparison bears any truth, the software might be really good. But also pretty memory hungry. And several tens GB / core is just too much. If you want to try segemehl, be sure to write your own wrapper script (perhaps stage-in the reference to a local scratch, not the ramdisk) and reserve sufficient memory. Be aware that you will be accounted for the prolonged run time and memory.
The currently installed module is
bio/segemehl/0.2.0-foss-2018a
TopHat
TopHat is a fast splice junction mapper for RNA-Seq reads.
Module can be found at:
bio/TopHat/<version>
This program is not yet incorporated into the wrapping module.
yara
yara is a mapping tool with “with approximate seeds and multiple backtracking”.
It is available within the modules
bio/SeqAn/<version>
You can find a wrapper to ease your workflow, below.
Wrapper Scripts
"Standard Mappers"
Most mapping tools adhere to this paradigm: They work on a reference (directory). They are, therefore, easily wrapped, such that the reference can be staged-in to a node-local directory (e.g. a ramdisk) in order to avoid random I/O (and consequently prolonged run times) on the parallel file system.
Now, to leverage the task from 1 (or a few) samples to be mapped to several in parallel, we provide a wrapper script, which is available as a module:
bio/parallel_MappingTools
The code is under version management and hosted internally, here.
The wrapper script will submit a job, it is not intended to be just within a SLURM environment, but rather creates one.
Calling MapperWrapper -h
will display a help message with all the options, the script provides. Likewise, the call MapperWrapper –credits
will display credits and a version history.
The script, after loading the module, can then be run like:
$ MapperWrapper --executable=<executable> [options] <referencedir> <inputdir>
Considerations:
- The wrapper recognizes FASTQ files with suffixes “
*.gz
”, “*.fastq
” or “*.fq
” and will always assume FASTQ files (compressed or uncompressed). yara accepts bzipped files, too. - The number of processes (and therefore nodes) is limited to the number of samples.
- The wrapper only works for paired end sequencing data, where the file tuples are designated with the following strings “
_1
” and “_2
” or “_R1
” and “_R2
”, respectively. - There are only a few options, as internally the wrapper calls
bwa mem
(orbwa aln
in the single end case) and only sets up a few things to yield performance. Likewise a switch for single and paired end data exists for other mappers.
About Arguments:
referencedir
needs to be the (relative) path to a directory containing an indexed BWA referenceinputdir
needs to be a (relative) path to a directory containing all inputs. Subdirectories and files containing the stringunpaired
are ignored; this is to support preprocessing with the |quality check module.
The options:
MapperWrapper
attempts to deduce your SLURM account. This may fail, in which case-A, –account
needs to be supplied.–verbose,–no-verbose
verbose execution (off by default)–executable
mandatory argument to designate the executable possible arguments:bwa
,bowtie2
,yara
-d,–dependency
, list of comma separated jobids, the job will wait for to finish-l,–runlimit
, this defaults to 300 minutes.-p,–partition
, the default isnodeshort
orparallel
on Mogon2, no smp-partition should be choosen.-o,–outdir
output directory path (default is the current working directory)–tag
optional tag/prefix for logfiles and directories–groups
set to provide a lists of read group tags (len(groups) must equal to No. of files)–single
(no arguments) to evaluate single end data–args
to supply additional flags, e. g.–args=“-l 1024 -n 0.02”
for BWA - note the quotation marks, they are necessary.
Output:
- Per input tuple (paired sequencing data, only) a BAM file with the prefix of the input will be written. In the case of single end data, there will be one output per input, only.
Generating Read Group Tags
Read group tags can be inserted with the –groups
flag2). The tags are supplied as a list on the command line. An example code to generate a tag list for consecutively ordered tags would be:
# defining the input directory appropriately in a master script: inputdir=/some/path/to/your/data # assuming '_R1' defines the forward reads in a paired end scenario # a template - may deviate from project to project template="@RG\tID:+ID+\tLB:unknown_lb\tPL:illumina\tSM:sample+ID+" # the tag list to be generated tags="" # number of samples - this snippet could be integrated in a script nsamples=$(find $inputdir -name '*_R1*.fastq' | grep -v unpaired | wc -l) # now the actual generation: for ((i=1; i <= $nsamples; i++)); do tags="$tags $(sed -e "s/+ID+/$i/g" <<< $template)" done
GPU-based
Whilst adhering to the same paradigm, mentioned above, barracuda
is the only read-mapping software supported, which works on GPUs3). This is different and peculiar in its setup and merits a separate module:
To leverage the task from 1 (or a few) samples to be mapped to several in parallel, we provide a wrapper script, which is available as a module:
bio/parallel_Barracuda
Calling parallel_Barracuda -h
will display a help message with all the options, the script provides. Likewise, the call parallel_Barracuda –credits
will display credits and a version history.
The script, after loading the module, can then be run like:
$ parallel_Barracuda [options] <referencedir> <inputdir>
Considerations:
- See the "standard" Mappers
- Also: The script will only use the
m2_gpu
partition and therefore needs an account with them2_
prefix4).
About Arguments:
referencedir
needs to be the (relative) path to a directory containing an indexed BWA reference. No symbolic links are allowed.inputdir
needs to be a (relative) path to a directory containing all inputs. Subdirectories and files containing the stringunpaired
are ignored; this is to support preprocessing with the trimmomatic module.
The options:
parallel_BWA
attempts to deduce your SLURM account. This may fail, in which case-A, –account
needs to be supplied.-d,–dependency
, list of comma separated jobids, the job will wait for to finish-l,–runlimit
, this defaults to 300 minutes.-o,–outdir
output directory path (default is the current working directory)
Output:
- Per input tuple (paired sequencing data, only) a BAM file with the prefix of the input will be written. In the case of single end data, there will be one output per input, only.
Comparison Benchmarks
Notwithstanding, own benchmarks a first impression can be found in the same blog.
This part needs some more time to be finished ….