software:topical:lifescience:ngs_read_mapping_tools

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
software:topical:lifescience:ngs_read_mapping_tools [2018/09/17 14:26]
meesters
software:topical:lifescience:ngs_read_mapping_tools [2020/10/02 14:52]
jrutte02 removed
Line 1: Line 1:
 ====== NGS Read Mapping Software on Mogon ====== ====== NGS Read Mapping Software on Mogon ======
  
-<WRAP center round todo 65%> +As a first introduction into NGS alignment software tools we recommend reading this short [[https://www.ecseq.com/support/ngs/what-is-the-best-ngs-alignment-software|blog post]]Or in other words: It might be, that the list of supported tools grows and grows, [[https://hpc.uni-mainz.de/high-performance-computing/service-angebot/softwareinstallation/|due to your requests]], but will never really cover everybody's favorite tool - there are just too many and some are just not worth having.
-This page is currently under construction +
-</WRAP>+
  
-As a first introduction into NGS alignment software tools we recommend reading this short [[https://www.ecseq.com/support/ngs/what-is-the-best-ngs-alignment-software|blog post]]. Or in other words: It might be, that the list of supported tools grows and grows, [[https://hpc.uni-mainz.de/high-performance-computing/service-angebot/softwareinstallation/|due to your requests]], but will never really cover everybody's favorite tool. 
- 
-Notwithstanding, own [[software:topical:lifescience:ngs_read_mapping_tools#Comparison_Benchmarks|benchmarks]] a first impression can be found in [[http://www.ecseq.com/support/benchmark.html|the same blog]]. 
 ===== Software Options ===== ===== Software Options =====
  
Line 14: Line 9:
 [[http://bio-bwa.sourceforge.net/|BWA]] is one mapping tool, particularly to map "low-divergent sequences against a large reference genome". Modules on Mogon can be found as((loading a module without version specification will load the most recent one)): [[http://bio-bwa.sourceforge.net/|BWA]] is one mapping tool, particularly to map "low-divergent sequences against a large reference genome". Modules on Mogon can be found as((loading a module without version specification will load the most recent one)):
  
-''bio/BWA''+''bio/BWA/<version>''
  
-=== The Wrapper Script ===+You can find a wrapper to ease your workflow, [[software:topical:lifescience:#standard_mappers|below]].
  
-<WRAP center round alert 90%> + 
-The wrapper script is not installedyet.+==== BarraCuda ==== 
 + 
 +[[http://seqbarracuda.sourceforge.net/|Barracuda]] is a GPU-accelerated implementation of [[http://bio-bwa.sourceforge.net/|BWA]] and can be found on Mogon as the module 
 + 
 +''bio/barracuda'' 
 + 
 +It does not support ''bwa mem ...'' but rather leverages ''bwa aln ...'' to GPUs. 
 + 
 +See [[:software:topical:lifescience:ngs_read_mapping_tools#gpu-based|below for a wrapper script]] to ease your workflow. 
 + 
 +==== Minimap2 ==== 
 + 
 +[[https://github.com/lh3/minimap2|Minimap2]] is supposed to be a replacement for ''bwa mem''. Modules are installed under  
 + 
 +''bio/minimap2'' 
 + 
 + 
 +==== RazerS 3 ==== 
 + 
 +[[https://academic.oup.com/bioinformatics/article/28/20/2592/206947|RazerS 3]] as [[:software:topical:lifescience:ngs_read_mapping_tools#yara|yara]] is part of the seqan modules: 
 + 
 +''bio/SeqAn/<version>'' 
 + 
 +You can find a wrapper to ease your workflow, [[software:topical:lifescience:#standard_mappers|below]]. 
 + 
 + 
 +==== Bowtie2 ==== 
 + 
 +[[https://www.nature.com/articles/nmeth.1923|Bowtie2]] is a well known read aligner with a focus on gapped alignments. 
 + 
 +Module(s) can be found at: 
 + 
 +''bio/Bowtie2/<version>'' 
 + 
 +You can find a wrapper to ease your workflow, [[software:topical:lifescience:#standard_mappers|below]]. 
 + 
 +==== STAR ==== 
 + 
 +[[https://www.ncbi.nlm.nih.gov/pubmed/23104886|STAR]] is a well known mapping tool for RNA-Seq data.  
 + 
 +Module(s) can be found at: 
 + 
 +''bio/STAR/<version>'' 
 + 
 +You can find a wrapper to ease your workflow, [[software:topical:lifescience:#standard_mappers|below]]. 
 + 
 +==== segemehl ==== 
 + 
 +[[https://www.ncbi.nlm.nih.gov/pubmed/24626854|segemehl]] seems to be a pretty good alignment tool, mentioned here, due to the blog which is cited below. 
 + 
 +<WRAP center round info 90%> 
 +There will be no wrapper script for ''segemehl'': If this [[http://www.ecseq.com/support/benchmark.html|comparison]] bears any truth, the software might be really good. But also pretty memory hungry. And several tens GB / core is just too much. If you want to try segemehlbe sure to write your own wrapper script (perhaps stage-in the reference to a local scratch, not the ramdisk) and reserve sufficient memory. Be aware that you will be accounted for the prolonged run time and memory
 </WRAP> </WRAP>
  
-To leverage the task from 1 (or a few) samples to be mapped to several in parallel, we provide a wrapper script, which is available as a module: +The currently installed module is 
 + 
 +''bio/segemehl/0.2.0-foss-2018a'' 
 + 
 +==== TopHat ==== 
 + 
 +[[https://ccb.jhu.edu/software/tophat/index.shtml|TopHat]] is a fast splice junction mapper for RNA-Seq reads. 
 + 
 +Module can be found at: 
 + 
 +''bio/TopHat/<version>'' 
 + 
 + 
 +<WRAP center round info 90%> 
 +This program is not yet incorporated into the wrapping module. 
 +</WRAP> 
 +==== yara ==== 
 + 
 +[[https://academic.oup.com/nar/article/41/7/e78/1068067|yara]] is a mapping tool with "with approximate seeds and multiple backtracking".  
 + 
 +It is available within the modules 
 + 
 +''bio/SeqAn/<version>'' 
 + 
 +You can find a wrapper to ease your workflow, [[software:topical:lifescience:#standard_mappers|below]]. 
 + 
 +===== Wrapper Scripts ===== 
 + 
 +==== "Standard Mappers" ==== 
 + 
 +Most mapping tools adhere to this paradigm: They work on a reference (directory). They are, therefore, easily wrapped, such that the reference can be staged-in to a node-local directory (e.g. a [[:ramdisk|ramdisk]]) in order to avoid random I/O (and consequently prolonged run times) on the parallel file system. 
 + 
 + 
 +Now, to leverage the task from 1 (or a few) samples to be mapped to several in parallel, we provide a wrapper script, which is available as a module: 
  
-''bio/parallel_BWA''+''bio/parallel_MappingTools''
  
 The code is under version management and hosted [[https://gitlab.rlp.net/hpc-jgu-lifescience/seq-analysis|internally, here]]. The code is under version management and hosted [[https://gitlab.rlp.net/hpc-jgu-lifescience/seq-analysis|internally, here]].
Line 32: Line 111:
 </WRAP> </WRAP>
  
-Calling ''parallel_BWA -h'' will display a help message with all the options, the script provides. Likewise, the call ''parallel_BWA --credits'' will display credits and a version history.+Calling ''MapperWrapper -h'' will display a help message with all the options, the script provides. Likewise, the call ''MapperWrapper --credits'' will display credits and a version history.
  
 The script, after loading the module, can then be run like: The script, after loading the module, can then be run like:
  
 <code bash> <code bash>
-parallel_BWA [options] <referencedir> <inputdir>+MapperWrapper --executable=<executable> [options] <referencedir> <inputdir>
 </code> </code>
  
 <WRAP center round important 90%> <WRAP center round important 90%>
-**Limitations**:+**Considerations**:
  
-  * The wrapper recognizes FASTQ files with suffixes "''*.gz''", "''*.fastq''" or "''*.fq''" and will allways assume FASTQ files (compressed or uncompressed).+  * The wrapper recognizes FASTQ files with suffixes "''*.gz''", "''*.fastq''" or "''*.fq''" and will always assume FASTQ files (compressed or uncompressed). [[software:topical:lifescience:#yara|yara]] accepts bzipped files, too.
   * The number of processes (and therefore nodes) is limited to the number of samples.   * The number of processes (and therefore nodes) is limited to the number of samples.
   * The wrapper only works for paired end sequencing data, where the file tuples are designated with the following strings "''_1''" and "''_2''" or "''_R1''" and "''_R2''", respectively.   * The wrapper only works for paired end sequencing data, where the file tuples are designated with the following strings "''_1''" and "''_2''" or "''_R1''" and "''_R2''", respectively.
-  * BWA does not scale well to big data. It is better to split input to chuncks of ~1GB +  * There are only a few options, as internally the wrapper calls ''bwa mem'' (or ''bwa aln'' in the single end case) and only sets up a few things to yield performance. Likewise a switch for single and paired end data exists for other mappers.
-  * BWA does not scale well beyond a NUMA block (8 threads on Mogon I) +
-  * There are only a few options, as internally the wrapper calls ''bwa mem'' and only sets up a few things to yield performance.+
 </WRAP> </WRAP>
  
Line 54: Line 131:
  
   * ''referencedir'' needs to be the (relative) path to a directory containing an indexed BWA reference   * ''referencedir'' needs to be the (relative) path to a directory containing an indexed BWA reference
-  * ''inputdir'' needs to be a (relative) path to a directory containing all inputs. Subdirectories and files containing the string ''unpaired'' are ignored; this is to support preprocessing with the [[software:topical:lifescience:trimmomatic|trimmomatic module]].+  * ''inputdir'' needs to be a (relative) path to a directory containing all inputs. Subdirectories and files containing the string ''unpaired'' are ignored; this is to support preprocessing with the [[software:topical:lifescience:qc|quality check module]].
  
 The options: The options:
-  * ''parallel_BWA'' attempts to deduce your SLURM account. This may fail, in which case ''-A, --account'' needs to be supplied. +  * ''MapperWrapper'' attempts to deduce your SLURM account. This may fail, in which case ''-A, --account'' needs to be supplied. 
-  * ''-N,--nodes'' allows to reserve more than 1 node (the default). This may speed up the screening; see the limitations above.+  * ''--verbose,--no-verbose''  verbose execution (off by default) 
 +  * ''--executable''  mandatory argument to designate the executable possible arguments: ''bwa'', ''bowtie2'', ''yara''
   * ''-d,--dependency'', list of comma separated jobids, the job will wait for to finish   * ''-d,--dependency'', list of comma separated jobids, the job will wait for to finish
   * ''-l,--runlimit'', this defaults to 300 minutes.   * ''-l,--runlimit'', this defaults to 300 minutes.
   * ''-p,--partition'', the default is ''nodeshort'' or ''parallel'' on Mogon2, no smp-partition should be choosen.   * ''-p,--partition'', the default is ''nodeshort'' or ''parallel'' on Mogon2, no smp-partition should be choosen.
-  * ''-t,--threads'', BWA can work in parallel. Please consult the manual. The default is 8. 
   * ''-o,--outdir'' output directory path (default is the current working directory)   * ''-o,--outdir'' output directory path (default is the current working directory)
 +  * ''--tag'' optional tag/prefix for logfiles and directories
 +  * ''--groups'' set to provide a lists of read group tags (len(groups) must equal to No. of files)
 +  * ''--single'' (no arguments) to evaluate single end data
 +  * ''--args'' to supply additional flags, e. g. ''--args="-l 1024 -n 0.02"'' for BWA - note the quotation marks, they are necessary.
      
-  +Output:
  
-==== BarraCuda ====+  * Per input tuple (paired sequencing data, only) a BAM file with the prefix of the input will be written. In the case of single end data, there will be one output per input, only.
  
-[[http://seqbarracuda.sourceforge.net/|Barracuda]] is a GPU-accelerated implementation of [[http://bio-bwa.sourceforge.net/|BWA]] and can be found on Mogon as the module+=== Generating Read Group Tags ===
  
-''bio/barracuda''+Read group tags can be inserted with the ''--groups'' flag((From version 0.6 onward.)). The tags are supplied as a list on the command line. An example code to generate a tag list for consecutively ordered tags would be:
  
-It does not support ''bwa mem ...'' but rather leverages ''bwa aln ...'' to GPUs.+<code bash> 
 +# defining the input directory appropriately in a master script: 
 +inputdir=/some/path/to/your/data # assuming '_R1defines the forward reads in a paired end scenario
  
-=== The Wrapper Script ===+# a template - may deviate from project to project 
 +template="@RG\tID:+ID+\tLB:unknown_lb\tPL:illumina\tSM:sample+ID+" 
 +# the tag list to be generated 
 +tags="" 
 +# number of samples - this snippet could be integrated in a script  
 +nsamples=$(find $inputdir -name '*_R1*.fastq' | grep -v unpaired | wc -l) 
 +# now the actual generation: 
 +for ((i=1; i <$nsamples; i++)); do 
 +  tags="$tags $(sed -e "s/+ID+/$i/g" <<< $template)" 
 +done 
 +</code>
  
-==== Razer3 ==== 
  
-=== The Wrapper Script ===+==== GPU-based ====
  
-==== Bowtie2 ====+Whilst adhering to the same paradigm, mentioned above, ''barracuda'' is the only read-mapping software supported, which works on GPUs((If you like to see additional tools installed and / or supported, get in touch with us.)). This is different and peculiar in its setup and merits a separate module:
  
-[[https://www.nature.com/articles/nmeth.1923|Bowtie2]] is well known read aligner with focus on gapped alignments.+To leverage the task from 1 (or few) samples to be mapped to several in parallel, we provide wrapper script, which is available as a module: 
  
-As //preliminary// scaling tests indicate that the program can scale to a full node and is still reasonably fast, no wrapper script has been installed as a module, so far((If you feel a workflow logic can profit from a wrapper, please approach us.)). Instead, a few samples are given:+''bio/parallel_Barracuda''
  
-=== A Sample Script ===+Calling ''parallel_Barracuda -h'' will display a help message with all the options, the script provides. Likewise, the call ''parallel_Barracuda --credits'' will display credits and a version history.
  
-==== STAR ====+The script, after loading the module, can then be run like:
  
-=== The Wrapper Script ===+<code bash> 
 +$ parallel_Barracuda [options] <referencedir> <inputdir> 
 +</code>
  
-==== segemehl ====+<WRAP center round important 90%> 
 +**Considerations**: 
 +  * See the [[software:topical:lifescience:ngs_read_mapping_tools#standard_mappers|"standard" Mappers]] 
 +  * Also: The script will only use the ''m2_gpu'' partition and therefore needs an account with the ''m2_'' prefix((This is because development to support the wild "zoo" of hardware and partition setting is hardly worth the effort for this software, as tests show that standard bwa (properly mapped) outperforms the gpu version.)). 
 +</WRAP>
  
-[[https://www.ncbi.nlm.nih.gov/pubmed/24626854|segemehl]] seems to be a pretty good alignment tool, mentioned here, due to the blog which is cited below. 
  
-<WRAP center round info 90%> +About Arguments: 
-There will be no wrapper script for ''segemehl'': If this [[http://www.ecseq.com/support/benchmark.html|comparison]] bears any truth, the software might be really goodBut also pretty memory hungry. And several tens GB / core is just too mutch. If you want to try segemehlbe sure to write your own wrapper script (perhaps stage-in the reference to a local scratchnot the ramdiskand reserve sufficient memory. Be aware that you will be accounted for the pro-longed run time and memory +  * ''referencedir'' needs to be the (relative) path to a directory containing an indexed BWA reference. No symbolic links are allowed. 
-</WRAP>+  * ''inputdir'' needs to be a (relative) path to a directory containing all inputs. Subdirectories and files containing the string ''unpaired'' are ignored; this is to support preprocessing with the [[software:topical:lifescience:qc|quality check module]]. 
 + 
 + 
 +The options: 
 +  * ''parallel_BWA'' attempts to deduce your SLURM account. This may failin which case ''-A, --account'' needs to be supplied. 
 +  * ''-d,--dependency'', list of comma separated jobids, the job will wait for to finish 
 +  * ''-l,--runlimit'', this defaults to 300 minutes. 
 +  * ''-o,--outdir'' output directory path (default is the current working directory) 
 +   
 +Output: 
 + 
 +  * Per input tuple (paired sequencing data, only) a BAM file with the prefix of the input will be written. In the case of single end data, there will be one output per input, only.
  
 ===== Comparison Benchmarks ===== ===== Comparison Benchmarks =====
 +
 +
 +Notwithstanding, own [[software:topical:lifescience:ngs_read_mapping_tools#Comparison_Benchmarks|benchmarks]] a first impression can be found in [[http://www.ecseq.com/support/benchmark.html|the same blog]].
  
 <WRAP center round todo 65%> <WRAP center round todo 65%>
 This part needs some more time to be finished .... This part needs some more time to be finished ....
 </WRAP> </WRAP>