Both sides previous revision
Previous revision
Next revision
|
Previous revision
|
software:topical:lifescience:ngs_read_mapping_tools [2018/12/13 12:24] meesters |
— (current) |
====== NGS Read Mapping Software on Mogon ====== | |
| |
As a first introduction into NGS alignment software tools we recommend reading this short [[https://www.ecseq.com/support/ngs/what-is-the-best-ngs-alignment-software|blog post]]. Or in other words: It might be, that the list of supported tools grows and grows, [[https://hpc.uni-mainz.de/high-performance-computing/service-angebot/softwareinstallation/|due to your requests]], but will never really cover everybody's favorite tool. | |
| |
Notwithstanding, own [[software:topical:lifescience:ngs_read_mapping_tools#Comparison_Benchmarks|benchmarks]] a first impression can be found in [[http://www.ecseq.com/support/benchmark.html|the same blog]]. | |
===== Software Options ===== | |
| |
==== BWA ==== | |
| |
[[http://bio-bwa.sourceforge.net/|BWA]] is one mapping tool, particularly to map "low-divergent sequences against a large reference genome". Modules on Mogon can be found as((loading a module without version specification will load the most recent one)): | |
| |
''bio/BWA/<version>'' | |
| |
=== The Wrapper Script === | |
| |
| |
==== BarraCuda ==== | |
| |
[[http://seqbarracuda.sourceforge.net/|Barracuda]] is a GPU-accelerated implementation of [[http://bio-bwa.sourceforge.net/|BWA]] and can be found on Mogon as the module | |
| |
''bio/barracuda'' | |
| |
It does not support ''bwa mem ...'' but rather leverages ''bwa aln ...'' to GPUs. | |
| |
See [[:software:topical:lifescience:ngs_read_mapping_tools#gpu-based|below for a wrapper script]] to ease your workflow. | |
| |
| |
| |
==== RazerS 3 ==== | |
| |
[[https://academic.oup.com/bioinformatics/article/28/20/2592/206947|RazerS 3]] as [[:software:topical:lifescience:ngs_read_mapping_tools#yara|yara]] is part of the seqan modules: | |
| |
''bio/SeqAn/<version>'' | |
| |
You can find a wrapper to ease your workflow, [[software:topical:lifescience:#standard_mappers|below]], eventually ((not yet)). | |
| |
| |
==== Bowtie2 ==== | |
| |
[[https://www.nature.com/articles/nmeth.1923|Bowtie2]] is a well known read aligner with a focus on gapped alignments. | |
| |
==== STAR ==== | |
| |
<WRAP center round todo 65%> | |
More info soon-ish. | |
</WRAP> | |
| |
==== segemehl ==== | |
| |
[[https://www.ncbi.nlm.nih.gov/pubmed/24626854|segemehl]] seems to be a pretty good alignment tool, mentioned here, due to the blog which is cited below. | |
| |
<WRAP center round info 90%> | |
There will be no wrapper script for ''segemehl'': If this [[http://www.ecseq.com/support/benchmark.html|comparison]] bears any truth, the software might be really good. But also pretty memory hungry. And several tens GB / core is just too much. If you want to try segemehl, be sure to write your own wrapper script (perhaps stage-in the reference to a local scratch, not the ramdisk) and reserve sufficient memory. Be aware that you will be accounted for the prolonged run time and memory. | |
</WRAP> | |
| |
The currently installed module is | |
| |
''bio/segemehl/0.2.0-foss-2018a'' | |
| |
==== yara ==== | |
| |
[[https://academic.oup.com/nar/article/41/7/e78/1068067|yara]] is a mapping tool with "with approximate seeds and multiple backtracking". | |
| |
It is available within the modules | |
| |
''bio/SeqAn/<version>'' | |
| |
You can find a wrapper to ease your workflow, [[software:topical:lifescience:#standard_mappers|below]]. | |
| |
===== Wrapper Scripts ===== | |
| |
==== "Standard Mappers" ==== | |
| |
Most mapping tools adhere to this paradigm: They work on a reference (directory). They are, therefore, easily wrapped, such that the reference can be staged-in to a node-local directory (e.g. a [[:ramdisk|ramdisk]]) in order to avoid random I/O (and consequently prolonged run times) on the parallel file system. | |
| |
| |
Now, to leverage the task from 1 (or a few) samples to be mapped to several in parallel, we provide a wrapper script, which is available as a module: | |
| |
''bio/parallel_MappingTools'' | |
| |
The code is under version management and hosted [[https://gitlab.rlp.net/hpc-jgu-lifescience/seq-analysis|internally, here]]. | |
| |
<WRAP center round important 90%> | |
The wrapper script will submit a job, it is not intended to be just within a SLURM environment, but rather creates one. | |
</WRAP> | |
| |
Calling ''MapperWrapper -h'' will display a help message with all the options, the script provides. Likewise, the call ''MapperWrapper --credits'' will display credits and a version history. | |
| |
The script, after loading the module, can then be run like: | |
| |
<code bash> | |
$ MapperWrapper [options] <referencedir> <inputdir> | |
</code> | |
| |
<WRAP center round important 90%> | |
**Considerations**: | |
| |
* The wrapper recognizes FASTQ files with suffixes "''*.gz''", "''*.fastq''" or "''*.fq''" and will always assume FASTQ files (compressed or uncompressed). [[software:topical:lifescience:#yara|yara]] accepts bzipped files, too. | |
* The number of processes (and therefore nodes) is limited to the number of samples. | |
* The wrapper only works for paired end sequencing data, where the file tuples are designated with the following strings "''_1''" and "''_2''" or "''_R1''" and "''_R2''", respectively. | |
* There are only a few options, as internally the wrapper calls ''bwa mem'' (or ''bwa aln'' in the single end case) and only sets up a few things to yield performance. Likewise a switch for single and paired end data exists for other mappers. | |
</WRAP> | |
| |
About Arguments: | |
| |
* ''referencedir'' needs to be the (relative) path to a directory containing an indexed BWA reference | |
* ''inputdir'' needs to be a (relative) path to a directory containing all inputs. Subdirectories and files containing the string ''unpaired'' are ignored; this is to support preprocessing with the [[software:topical:lifescience:trimmomatic|trimmomatic module]]. | |
| |
The options: | |
* ''MapperWrapper'' attempts to deduce your SLURM account. This may fail, in which case ''-A, --account'' needs to be supplied. | |
* ''--verbose,--no-verbose'' verbose execution (off by default) | |
* ''-d,--dependency'', list of comma separated jobids, the job will wait for to finish | |
* ''-l,--runlimit'', this defaults to 300 minutes. | |
* ''-p,--partition'', the default is ''nodeshort'' or ''parallel'' on Mogon2, no smp-partition should be choosen. | |
* ''-o,--outdir'' output directory path (default is the current working directory) | |
* ''--single'' (no arguments) to evaluate single end data | |
* ''--args'' to supply additional flags, e. g. ''--args="-l 1024 -n 0.02"'' for BWA - note the quotation marks, they are necessary. | |
| |
Output: | |
| |
* Per input tuple (paired sequencing data, only) a BAM file with the prefix of the input will be written. In the case of single end data, there will be one output per input, only. | |
| |
| |
==== GPU-based ==== | |
| |
Whilst adhering to the same paradigm, mentioned above, ''barracuda'' is the only read-mapping software supported, which works on GPUs((If you like to see additional tools installed and / or supported, get in touch with us.)). This is different and peculiar in its setup and merits a separate module: | |
| |
To leverage the task from 1 (or a few) samples to be mapped to several in parallel, we provide a wrapper script, which is available as a module: | |
| |
''bio/parallel_Barracuda'' | |
| |
Calling ''parallel_Barracuda -h'' will display a help message with all the options, the script provides. Likewise, the call ''parallel_Barracuda --credits'' will display credits and a version history. | |
| |
The script, after loading the module, can then be run like: | |
| |
<code bash> | |
$ parallel_Barracuda [options] <referencedir> <inputdir> | |
</code> | |
| |
<WRAP center round important 90%> | |
**Limitations**: | |
* See the parallel_BWA wrapper | |
* Also: The script will only use the ''m2_gpu'' partition and therefore needs an account with the ''m2_'' prefix. | |
</WRAP> | |
| |
| |
About Arguments: | |
* ''referencedir'' needs to be the (relative) path to a directory containing an indexed BWA reference. No symbolic links are allowed. | |
* ''inputdir'' needs to be a (relative) path to a directory containing all inputs. Subdirectories and files containing the string ''unpaired'' are ignored; this is to support preprocessing with the [[software:topical:lifescience:trimmomatic|trimmomatic module]]. | |
| |
| |
The options: | |
* ''parallel_BWA'' attempts to deduce your SLURM account. This may fail, in which case ''-A, --account'' needs to be supplied. | |
* ''-d,--dependency'', list of comma separated jobids, the job will wait for to finish | |
* ''-l,--runlimit'', this defaults to 300 minutes. | |
* ''-o,--outdir'' output directory path (default is the current working directory) | |
| |
Output: | |
| |
* Per input tuple (paired sequencing data, only) a BAM file with the prefix of the input will be written. In the case of single end data, there will be one output per input, only. | |
| |
===== Comparison Benchmarks ===== | |
| |
<WRAP center round todo 65%> | |
This part needs some more time to be finished .... | |
</WRAP> | |