software:topical:lifescience:ngs_read_mapping_tools

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
software:topical:lifescience:ngs_read_mapping_tools [2018/12/13 11:36]
meesters [NGS Read Mapping Software on Mogon]
— (current)
Line 1: Line 1:
-====== NGS Read Mapping Software on Mogon ====== 
  
-As a first introduction into NGS alignment software tools we recommend reading this short [[https://www.ecseq.com/support/ngs/what-is-the-best-ngs-alignment-software|blog post]]. Or in other words: It might be, that the list of supported tools grows and grows, [[https://hpc.uni-mainz.de/high-performance-computing/service-angebot/softwareinstallation/|due to your requests]], but will never really cover everybody's favorite tool. 
- 
-Notwithstanding, own [[software:topical:lifescience:ngs_read_mapping_tools#Comparison_Benchmarks|benchmarks]] a first impression can be found in [[http://www.ecseq.com/support/benchmark.html|the same blog]]. 
-===== Software Options ===== 
- 
-==== BWA ==== 
- 
-[[http://bio-bwa.sourceforge.net/|BWA]] is one mapping tool, particularly to map "low-divergent sequences against a large reference genome". Modules on Mogon can be found as((loading a module without version specification will load the most recent one)): 
- 
-''bio/BWA/<version>'' 
- 
-=== The Wrapper Script === 
- 
-To leverage the task from 1 (or a few) samples to be mapped to several in parallel, we provide a wrapper script, which is available as a module:  
- 
-''bio/parallel_BWA'' 
- 
-The code is under version management and hosted [[https://gitlab.rlp.net/hpc-jgu-lifescience/seq-analysis|internally, here]]. 
- 
-<WRAP center round important 90%> 
-The wrapper script will submit a job, it is not intended to be just within a SLURM environment, but rather creates one. 
-</WRAP> 
- 
-Calling ''parallel_BWA -h'' will display a help message with all the options, the script provides. Likewise, the call ''parallel_BWA --credits'' will display credits and a version history. 
- 
-The script, after loading the module, can then be run like: 
- 
-<code bash> 
-$ parallel_BWA [options] <referencedir> <inputdir> 
-</code> 
- 
-<WRAP center round important 90%> 
-**Limitations**: 
- 
-  * The wrapper recognizes FASTQ files with suffixes "''*.gz''", "''*.fastq''" or "''*.fq''" and will allways assume FASTQ files (compressed or uncompressed). 
-  * The number of processes (and therefore nodes) is limited to the number of samples. 
-  * The wrapper only works for paired end sequencing data, where the file tuples are designated with the following strings "''_1''" and "''_2''" or "''_R1''" and "''_R2''", respectively. 
-  * BWA does not scale well to big data. It is better to split input to chuncks of ~1GB (take this with a grain of salt: there are not scaling tests, yet) 
-  * BWA does not scale well beyond a NUMA block (8 threads on Mogon I) 
-  * There are only a few options, as internally the wrapper calls ''bwa mem'' (or ''bwa aln'' in the single end case) and only sets up a few things to yield performance. 
-</WRAP> 
- 
-About Arguments: 
- 
-  * ''referencedir'' needs to be the (relative) path to a directory containing an indexed BWA reference 
-  * ''inputdir'' needs to be a (relative) path to a directory containing all inputs. Subdirectories and files containing the string ''unpaired'' are ignored; this is to support preprocessing with the [[software:topical:lifescience:trimmomatic|trimmomatic module]]. 
- 
-The options: 
-  * ''parallel_BWA'' attempts to deduce your SLURM account. This may fail, in which case ''-A, --account'' needs to be supplied. 
-  * ''-N,--nodes'' allows to reserve more than 1 node (the default). This may speed up the screening; see the limitations above. 
-  * ''-d,--dependency'', list of comma separated jobids, the job will wait for to finish 
-  * ''-l,--runlimit'', this defaults to 300 minutes. 
-  * ''-p,--partition'', the default is ''nodeshort'' or ''parallel'' on Mogon2, no smp-partition should be choosen. 
-  * ''-t,--threads'', BWA can work in parallel. Please consult the manual. The default is 8. 
-  * ''-o,--outdir'' output directory path (default is the current working directory) 
-  * ''--single'' (no arguments) to evaluate single end data 
-  * ''--args'' to supply additional flags, e. g. ''--args="-l 1024 -n 0.02"'' for BWA - note the quotation marks, they are necessary. 
-   
-Output: 
- 
-  * Per input tuple (paired sequencing data, only) a BAM file with the prefix of the input will be written. In the case of single end data, there will be one output per input, only. 
- 
-==== BarraCuda ==== 
- 
-[[http://seqbarracuda.sourceforge.net/|Barracuda]] is a GPU-accelerated implementation of [[http://bio-bwa.sourceforge.net/|BWA]] and can be found on Mogon as the module 
- 
-''bio/barracuda'' 
- 
-It does not support ''bwa mem ...'' but rather leverages ''bwa aln ...'' to GPUs. 
- 
-See [[:software:topical:lifescience:ngs_read_mapping_tools#gpu-based|below for a wrapper script]] to ease your workflow. 
- 
- 
- 
-==== RazerS 3 ==== 
- 
-[[https://academic.oup.com/bioinformatics/article/28/20/2592/206947|RazerS 3]] as [[:software:topical:lifescience:ngs_read_mapping_tools#yara|yara]] is part of the seqan modules: 
- 
-''bio/SeqAn/<version>'' 
- 
-You can find a wrapper to ease your workflow, [[software:topical:lifescience:#standard_mappers|below]], eventually ((not yet)). 
- 
- 
-==== Bowtie2 ==== 
- 
-[[https://www.nature.com/articles/nmeth.1923|Bowtie2]] is a well known read aligner with a focus on gapped alignments. 
- 
-==== STAR ==== 
- 
-=== The Wrapper Script === 
- 
-==== segemehl ==== 
- 
-[[https://www.ncbi.nlm.nih.gov/pubmed/24626854|segemehl]] seems to be a pretty good alignment tool, mentioned here, due to the blog which is cited below. 
- 
-<WRAP center round info 90%> 
-There will be no wrapper script for ''segemehl'': If this [[http://www.ecseq.com/support/benchmark.html|comparison]] bears any truth, the software might be really good. But also pretty memory hungry. And several tens GB / core is just too much. If you want to try segemehl, be sure to write your own wrapper script (perhaps stage-in the reference to a local scratch, not the ramdisk) and reserve sufficient memory. Be aware that you will be accounted for the prolonged run time and memory.  
-</WRAP> 
- 
-==== yara ==== 
- 
-[[https://academic.oup.com/nar/article/41/7/e78/1068067|yara]] is a mapping tool with "with approximate seeds and multiple backtracking" 
- 
-It is available within the modules 
- 
-''bio/SeqAn/<version>'' 
- 
-You can find a wrapper to ease your workflow, [[software:topical:lifescience:#standard_mappers|below]]. 
- 
-===== Wrapper Scripts ===== 
- 
-==== "Standard Mappers" ==== 
- 
-Most mapping tools adhere to this paradigm: They work on a reference (directory). They are, therefore, easily wrapped, such that the reference can be staged-in to a node-local directory (e.g. a [[:ramdisk|ramdisk]]) in order to avoid random I/O (and consequently prolonged run times) on the parallel file system. 
- 
- 
-==== GPU-based ==== 
- 
-Whilst adhering to the same paradigm, mentioned above, ''barracuda'' is the only read-mapping software supported, which works on GPUs((If you like to see additional tools installed and / or supported, get in touch with us.)). This is different and peculiar in its setup and merits a separate module: 
- 
-To leverage the task from 1 (or a few) samples to be mapped to several in parallel, we provide a wrapper script, which is available as a module:  
- 
-''bio/parallel_Barracuda'' 
- 
-Calling ''parallel_Barracuda -h'' will display a help message with all the options, the script provides. Likewise, the call ''parallel_Barracuda --credits'' will display credits and a version history. 
- 
-The script, after loading the module, can then be run like: 
- 
-<code bash> 
-$ parallel_Barracuda [options] <referencedir> <inputdir> 
-</code> 
- 
-<WRAP center round important 90%> 
-**Limitations**: 
-  * See the parallel_BWA wrapper 
-  * Also: The script will only use the ''m2_gpu'' partition and therefore needs an account with the ''m2_'' prefix. 
-</WRAP> 
- 
- 
-About Arguments: 
-  * ''referencedir'' needs to be the (relative) path to a directory containing an indexed BWA reference. No symbolic links are allowed. 
-  * ''inputdir'' needs to be a (relative) path to a directory containing all inputs. Subdirectories and files containing the string ''unpaired'' are ignored; this is to support preprocessing with the [[software:topical:lifescience:trimmomatic|trimmomatic module]]. 
- 
- 
-The options: 
-  * ''parallel_BWA'' attempts to deduce your SLURM account. This may fail, in which case ''-A, --account'' needs to be supplied. 
-  * ''-d,--dependency'', list of comma separated jobids, the job will wait for to finish 
-  * ''-l,--runlimit'', this defaults to 300 minutes. 
-  * ''-o,--outdir'' output directory path (default is the current working directory) 
-   
-Output: 
- 
-  * Per input tuple (paired sequencing data, only) a BAM file with the prefix of the input will be written. In the case of single end data, there will be one output per input, only. 
- 
-===== Comparison Benchmarks ===== 
- 
-<WRAP center round todo 65%> 
-This part needs some more time to be finished .... 
-</WRAP> 
  • software/topical/lifescience/ngs_read_mapping_tools.1544697403.txt.gz
  • Last modified: 2018/12/13 11:36
  • by meesters