User Tools

Site Tools


software:topical:lifescience:ngs_read_mapping_tools

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
software:topical:lifescience:ngs_read_mapping_tools [2018/10/31 11:08]
meesters [BWA]
software:topical:lifescience:ngs_read_mapping_tools [2019/10/24 15:48] (current)
meesters [BarraCuda]
Line 1: Line 1:
 ====== NGS Read Mapping Software on Mogon ====== ====== NGS Read Mapping Software on Mogon ======
  
-<WRAP center round todo 65%> +As a first introduction into NGS alignment software tools we recommend reading this short [[https://​www.ecseq.com/​support/​ngs/​what-is-the-best-ngs-alignment-software|blog post]]Or in other words: It might be, that the list of supported tools grows and grows, [[https://​hpc.uni-mainz.de/​high-performance-computing/​service-angebot/​softwareinstallation/​|due to your requests]], but will never really cover everybody'​s favorite tool - there are just too many and some are just not worth having.
-This page is currently under construction +
-</WRAP>+
  
-As a first introduction into NGS alignment software tools we recommend reading this short [[https://​www.ecseq.com/​support/​ngs/​what-is-the-best-ngs-alignment-software|blog post]]. Or in other words: It might be, that the list of supported tools grows and grows, [[https://​hpc.uni-mainz.de/​high-performance-computing/​service-angebot/​softwareinstallation/​|due to your requests]], but will never really cover everybody'​s favorite tool. 
- 
-Notwithstanding,​ own [[software:​topical:​lifescience:​ngs_read_mapping_tools#​Comparison_Benchmarks|benchmarks]] a first impression can be found in [[http://​www.ecseq.com/​support/​benchmark.html|the same blog]]. 
 ===== Software Options ===== ===== Software Options =====
  
Line 14: Line 9:
 [[http://​bio-bwa.sourceforge.net/​|BWA]] is one mapping tool, particularly to map "​low-divergent sequences against a large reference genome"​. Modules on Mogon can be found as((loading a module without version specification will load the most recent one)): [[http://​bio-bwa.sourceforge.net/​|BWA]] is one mapping tool, particularly to map "​low-divergent sequences against a large reference genome"​. Modules on Mogon can be found as((loading a module without version specification will load the most recent one)):
  
-''​bio/​BWA''​+''​bio/​BWA/<​version>​''​
  
-=== The Wrapper Script ===+You can find a wrapper to ease your workflow, [[software:​topical:​lifescience:#​standard_mappers|below]].
  
-To leverage the task from 1 (or a few) samples to be mapped to several in parallel, we provide a wrapper script, which is available as a module: ​ 
  
-''​bio/​parallel_BWA''​+==== BarraCuda ==== 
 + 
 +[[http://​seqbarracuda.sourceforge.net/​|Barracuda]] is a GPU-accelerated implementation of [[http://​bio-bwa.sourceforge.net/​|BWA]] and can be found on Mogon as the module 
 + 
 +''​bio/​barracuda''​ 
 + 
 +It does not support ''​bwa mem ...''​ but rather leverages ''​bwa aln ...''​ to GPUs. 
 + 
 +See [[:​software:​topical:​lifescience:​ngs_read_mapping_tools#​gpu-based|below for a wrapper script]] to ease your workflow. 
 + 
 +==== Minimap2 ==== 
 + 
 +[[https://​github.com/​lh3/​minimap2|Minimap2]] is supposed to be a replacement for ''​bwa mem''​. Modules are installed under  
 + 
 +''​bio/​minimap2''​ 
 + 
 + 
 +==== RazerS 3 ==== 
 + 
 +[[https://​academic.oup.com/​bioinformatics/​article/​28/​20/​2592/​206947|RazerS 3]] as [[:​software:​topical:​lifescience:​ngs_read_mapping_tools#​yara|yara]] is part of the seqan modules: 
 + 
 +''​bio/​SeqAn/<​version>''​ 
 + 
 +You can find a wrapper to ease your workflow, [[software:​topical:​lifescience:#​standard_mappers|below]]. 
 + 
 + 
 +==== Bowtie2 ==== 
 + 
 +[[https://​www.nature.com/​articles/​nmeth.1923|Bowtie2]] is a well known read aligner with a focus on gapped alignments. 
 + 
 +Module(s) can be found at: 
 + 
 +''​bio/​Bowtie2/<​version>''​ 
 + 
 +You can find a wrapper to ease your workflow, [[software:​topical:​lifescience:#​standard_mappers|below]]. 
 + 
 +==== STAR ==== 
 + 
 +[[https://​www.ncbi.nlm.nih.gov/​pubmed/​23104886|STAR]] is a well known mapping tool for RNA-Seq data.  
 + 
 +Module(s) can be found at: 
 + 
 +''​bio/​STAR/<​version>''​ 
 + 
 +You can find a wrapper to ease your workflow, [[software:​topical:​lifescience:#​standard_mappers|below]]. 
 + 
 +==== segemehl ==== 
 + 
 +[[https://​www.ncbi.nlm.nih.gov/​pubmed/​24626854|segemehl]] seems to be a pretty good alignment tool, mentioned here, due to the blog which is cited below. 
 + 
 +<WRAP center round info 90%> 
 +There will be no wrapper script for ''​segemehl'':​ If this [[http://​www.ecseq.com/​support/​benchmark.html|comparison]] bears any truth, the software might be really good. But also pretty memory hungry. And several tens GB / core is just too much. If you want to try segemehl, be sure to write your own wrapper script (perhaps stage-in the reference to a local scratch, not the ramdisk) and reserve sufficient memory. Be aware that you will be accounted for the prolonged run time and memory.  
 +</​WRAP>​ 
 + 
 +The currently installed module is 
 + 
 +''​bio/​segemehl/​0.2.0-foss-2018a''​ 
 + 
 +==== TopHat ==== 
 + 
 +[[https://​ccb.jhu.edu/​software/​tophat/​index.shtml|TopHat]] is a fast splice junction mapper for RNA-Seq reads. 
 + 
 +Module can be found at: 
 + 
 +''​bio/​TopHat/<​version>''​ 
 + 
 + 
 +<WRAP center round info 90%> 
 +This program is not yet incorporated into the wrapping module. 
 +</​WRAP>​ 
 +==== yara ==== 
 + 
 +[[https://​academic.oup.com/​nar/​article/​41/​7/​e78/​1068067|yara]] is a mapping tool with "with approximate seeds and multiple backtracking"​.  
 + 
 +It is available within the modules 
 + 
 +''​bio/​SeqAn/<​version>''​ 
 + 
 +You can find a wrapper to ease your workflow, [[software:​topical:​lifescience:#​standard_mappers|below]]. 
 + 
 +===== Wrapper Scripts ===== 
 + 
 +==== "​Standard Mappers"​ ==== 
 + 
 +Most mapping tools adhere to this paradigm: They work on a reference (directory). They are, therefore, easily wrapped, such that the reference can be staged-in to a node-local directory (e.g. a [[:​ramdisk|ramdisk]]) in order to avoid random I/O (and consequently prolonged run times) on the parallel file system. 
 + 
 + 
 +Now, to leverage the task from 1 (or a few) samples to be mapped to several in parallel, we provide a wrapper script, which is available as a module:  
 + 
 +''​bio/​parallel_MappingTools''​
  
 The code is under version management and hosted [[https://​gitlab.rlp.net/​hpc-jgu-lifescience/​seq-analysis|internally,​ here]]. The code is under version management and hosted [[https://​gitlab.rlp.net/​hpc-jgu-lifescience/​seq-analysis|internally,​ here]].
Line 28: Line 111:
 </​WRAP>​ </​WRAP>​
  
-Calling ''​parallel_BWA ​-h''​ will display a help message with all the options, the script provides. Likewise, the call ''​parallel_BWA ​--credits''​ will display credits and a version history.+Calling ''​MapperWrapper ​-h''​ will display a help message with all the options, the script provides. Likewise, the call ''​MapperWrapper ​--credits''​ will display credits and a version history.
  
 The script, after loading the module, can then be run like: The script, after loading the module, can then be run like:
  
 <code bash> <code bash>
-parallel_BWA ​[options] <​referencedir>​ <​inputdir>​+MapperWrapper --executable=<​executable> ​[options] <​referencedir>​ <​inputdir>​
 </​code>​ </​code>​
  
 <WRAP center round important 90%> <WRAP center round important 90%>
-**Limitations**:+**Considerations**:
  
-  * The wrapper recognizes FASTQ files with suffixes "''​*.gz''",​ "''​*.fastq''"​ or "''​*.fq''"​ and will allways ​assume FASTQ files (compressed or uncompressed).+  * The wrapper recognizes FASTQ files with suffixes "''​*.gz''",​ "''​*.fastq''"​ or "''​*.fq''"​ and will always ​assume FASTQ files (compressed or uncompressed). [[software:​topical:​lifescience:#​yara|yara]] accepts bzipped files, too.
   * The number of processes (and therefore nodes) is limited to the number of samples.   * The number of processes (and therefore nodes) is limited to the number of samples.
   * The wrapper only works for paired end sequencing data, where the file tuples are designated with the following strings "''​_1''"​ and "''​_2''"​ or "''​_R1''"​ and "''​_R2''",​ respectively.   * The wrapper only works for paired end sequencing data, where the file tuples are designated with the following strings "''​_1''"​ and "''​_2''"​ or "''​_R1''"​ and "''​_R2''",​ respectively.
-  ​* BWA does not scale well to big data. It is better to split input to chuncks of ~1GB +  * There are only a few options, as internally the wrapper calls ''​bwa mem'' ​(or ''​bwa aln''​ in the single end case) and only sets up a few things to yield performance. Likewise a switch for single and paired end data exists for other mappers.
-  * BWA does not scale well beyond a NUMA block (8 threads on Mogon I) +
-  ​* There are only a few options, as internally the wrapper calls ''​bwa mem''​ and only sets up a few things to yield performance.+
 </​WRAP>​ </​WRAP>​
  
Line 50: Line 131:
  
   * ''​referencedir''​ needs to be the (relative) path to a directory containing an indexed BWA reference   * ''​referencedir''​ needs to be the (relative) path to a directory containing an indexed BWA reference
-  * ''​inputdir''​ needs to be a (relative) path to a directory containing all inputs. Subdirectories and files containing the string ''​unpaired''​ are ignored; this is to support preprocessing with the [[software:​topical:​lifescience:​trimmomatic|trimmomatic ​module]].+  * ''​inputdir''​ needs to be a (relative) path to a directory containing all inputs. Subdirectories and files containing the string ''​unpaired''​ are ignored; this is to support preprocessing with the [[software:​topical:​lifescience:​qc|quality check module]].
  
 The options: The options:
-  * ''​parallel_BWA''​ attempts to deduce your SLURM account. This may fail, in which case ''​-A,​ --account''​ needs to be supplied. +  * ''​MapperWrapper''​ attempts to deduce your SLURM account. This may fail, in which case ''​-A,​ --account''​ needs to be supplied. 
-  * ''​-N,--nodes'' ​allows to reserve more than 1 node (the default). This may speed up the screening; see the limitations above.+  * ''​--verbose,--no-verbose'' ​ ​verbose execution ​(off by default) 
 +  * ''​--executable'' ​ mandatory argument to designate ​the executable possible arguments: ''​bwa'',​ ''​bowtie2'',​ ''​yara''​
   * ''​-d,​--dependency'',​ list of comma separated jobids, the job will wait for to finish   * ''​-d,​--dependency'',​ list of comma separated jobids, the job will wait for to finish
   * ''​-l,​--runlimit'',​ this defaults to 300 minutes.   * ''​-l,​--runlimit'',​ this defaults to 300 minutes.
   * ''​-p,​--partition'',​ the default is ''​nodeshort''​ or ''​parallel''​ on Mogon2, no smp-partition should be choosen.   * ''​-p,​--partition'',​ the default is ''​nodeshort''​ or ''​parallel''​ on Mogon2, no smp-partition should be choosen.
-  * ''​-t,​--threads'',​ BWA can work in parallel. Please consult the manual. The default is 8. 
   * ''​-o,​--outdir''​ output directory path (default is the current working directory)   * ''​-o,​--outdir''​ output directory path (default is the current working directory)
 +  * ''​--tag''​ optional tag/prefix for logfiles and directories
 +  * ''​--groups''​ set to provide a lists of read group tags (len(groups) must equal to No. of files)
 +  * ''​--single''​ (no arguments) to evaluate single end data
 +  * ''​--args''​ to supply additional flags, e. g. ''​--args="​-l 1024 -n 0.02"''​ for BWA - note the quotation marks, they are necessary.
   ​   ​
 Output: Output:
  
-  * Per input tuple (paired sequencing data, only) a sorted ​BAM file with the prefix of the input will be written.+  * Per input tuple (paired sequencing data, only) a BAM file with the prefix of the input will be written. In the case of single end data, there will be one output per input, only.
  
-<WRAP center round info 90%> +=== Generating Read Group Tags ===
-Currently the wrapper supports a start like: +
-''​bwa mem ... | samtools view -Shb -o ...''​ +
-with flags controlling parallelism. Additional flags would require to add more boilerplate code to the wrapper. See below for note on improving wrapper scripts. +
-</​WRAP>​+
  
-==== BarraCuda ====+Read group tags can be inserted with the ''​--groups''​ flag((From version 0.6 onward.)). The tags are supplied as a list on the command line. An example code to generate a tag list for consecutively ordered tags would be:
  
-[[http://seqbarracuda.sourceforge.net/|Barracuda]] is a GPU-accelerated implementation of [[http://bio-bwa.sourceforge.net/​|BWA]] and can be found on Mogon as the module+<code bash> 
 +# defining the input directory appropriately in a master script: 
 +inputdir=/some/path/to/your/data # assuming '​_R1'​ defines ​the forward reads in a paired end scenario
  
-''​bio/barracuda''​+# a template - may deviate from project to project 
 +template="​@RG\tID:​+ID+\tLB:​unknown_lb\tPL:​illumina\tSM:​sample+ID+"​ 
 +# the tag list to be generated 
 +tags=""​ 
 +# number of samples - this snippet could be integrated in a script  
 +nsamples=$(find $inputdir -name '*_R1*.fastq' ​| grep -v unpaired | wc -l) 
 +# now the actual generation:​ 
 +for ((i=1; i <= $nsamples; i++)); do 
 +  tags="​$tags $(sed -e "s/+ID+/​$i/​g"​ <<<​ $template)"​ 
 +done 
 +</​code>​
  
-It does not support ''​bwa mem ...''​ but rather leverages ''​bwa aln ...''​ to GPUs. 
  
-=== The Wrapper Script ​===+==== GPU-based ====
  
-==== Razer3 ====+Whilst adhering to the same paradigm, mentioned above, ''​barracuda''​ is the only read-mapping software supported, which works on GPUs((If you like to see additional tools installed and / or supported, get in touch with us.)). This is different and peculiar in its setup and merits a separate module:
  
-=== The Wrapper Script ===+To leverage the task from 1 (or a few) samples to be mapped to several in parallel, we provide a wrapper script, which is available as a module: ​
  
-==== Bowtie2 ====+''​bio/​parallel_Barracuda''​
  
-[[https://​www.nature.com/​articles/​nmeth.1923|Bowtie2]] is well known read aligner ​with a focus on gapped alignments.+Calling ''​parallel_Barracuda -h''​ will display ​help message ​with all the options, the script provides. Likewise, the call ''​parallel_Barracuda --credits''​ will display credits and version history.
  
-As //​preliminary//​ scaling tests indicate that the program can scale to a full node and is still reasonably fast, no wrapper script has been installed as a module, ​so far((If you feel a workflow logic can profit from a wrapper, please approach us.)). Instead, a few samples are given:+The script, after loading ​the module, can then be run like:
  
-=== A Sample Script ===+<code bash> 
 +$ parallel_Barracuda [options] <​referencedir>​ <​inputdir>​ 
 +</​code>​
  
-==== STAR ====+<WRAP center round important 90%> 
 +**Considerations**:​ 
 +  * See the [[software:​topical:​lifescience:​ngs_read_mapping_tools#​standard_mappers|"​standard"​ Mappers]] 
 +  * Also: The script will only use the ''​m2_gpu''​ partition and therefore needs an account with the ''​m2_''​ prefix((This is because development to support the wild "​zoo"​ of hardware and partition setting is hardly worth the effort for this software, as tests show that standard bwa (properly mapped) outperforms the gpu version.)). 
 +</​WRAP>​
  
-=== The Wrapper Script === 
  
-==== segemehl ====+About Arguments:​ 
 +  * ''​referencedir''​ needs to be the (relative) path to a directory containing an indexed BWA reference. No symbolic links are allowed. 
 +  * ''​inputdir''​ needs to be a (relative) path to a directory containing all inputs. Subdirectories and files containing the string ''​unpaired''​ are ignored; this is to support preprocessing with the [[software:​topical:​lifescience:​qc|quality check module]].
  
-[[https://​www.ncbi.nlm.nih.gov/​pubmed/​24626854|segemehl]] seems to be a pretty good alignment tool, mentioned here, due to the blog which is cited below. 
  
-<WRAP center round info 90%> +The options: 
-There will be no wrapper script for ''​segemehl''​: If this [[http://​www.ecseq.com/​support/​benchmark.html|comparison]] bears any truththe software might be really goodBut also pretty memory hungry. And several tens GB / core is just too mutch. If you want to try segemehlbe sure to write your own wrapper script (perhaps stage-in the reference ​to a local scratchnot the ramdiskand reserve sufficient memory. Be aware that you will be accounted for the pro-longed run time and memory +  ​* ​''​parallel_BWA'' ​attempts to deduce your SLURM accountThis may failin which case ''​-A,​ --account''​ needs to be supplied. 
-</​WRAP>​+  * ''​-d,--dependency'',​ list of comma separated jobids, ​the job will wait for to finish 
 +  * ''​-l,--runlimit'',​ this defaults to 300 minutes. 
 +  * ''​-o,​--outdir''​ output directory path (default is the current working directory) 
 +   
 +Output: 
 + 
 +  * Per input tuple (paired sequencing data, only) a BAM file with the prefix of the input will be written. In the case of single end data, there will be one output per input, only.
  
 ===== Comparison Benchmarks ===== ===== Comparison Benchmarks =====
 +
 +
 +Notwithstanding,​ own [[software:​topical:​lifescience:​ngs_read_mapping_tools#​Comparison_Benchmarks|benchmarks]] a first impression can be found in [[http://​www.ecseq.com/​support/​benchmark.html|the same blog]].
  
 <WRAP center round todo 65%> <WRAP center round todo 65%>
 This part needs some more time to be finished .... This part needs some more time to be finished ....
 </​WRAP>​ </​WRAP>​
software/topical/lifescience/ngs_read_mapping_tools.1540980502.txt.gz · Last modified: 2018/10/31 11:08 by meesters