User Tools

Site Tools


software:topical:lifescience:ngs_read_mapping_tools

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
software:topical:lifescience:ngs_read_mapping_tools [2018/12/13 11:35]
meesters [RazerS 3]
software:topical:lifescience:ngs_read_mapping_tools [2019/10/24 15:48] (current)
meesters [BarraCuda]
Line 1: Line 1:
 ====== NGS Read Mapping Software on Mogon ====== ====== NGS Read Mapping Software on Mogon ======
  
-<WRAP center round todo 65%> +As a first introduction into NGS alignment software tools we recommend reading this short [[https://​www.ecseq.com/​support/​ngs/​what-is-the-best-ngs-alignment-software|blog post]]Or in other words: It might be, that the list of supported tools grows and grows, [[https://​hpc.uni-mainz.de/​high-performance-computing/​service-angebot/​softwareinstallation/​|due to your requests]], but will never really cover everybody'​s favorite tool - there are just too many and some are just not worth having.
-This page is currently under construction +
-</WRAP>+
  
-As a first introduction into NGS alignment software tools we recommend reading this short [[https://​www.ecseq.com/​support/​ngs/​what-is-the-best-ngs-alignment-software|blog post]]. Or in other words: It might be, that the list of supported tools grows and grows, [[https://​hpc.uni-mainz.de/​high-performance-computing/​service-angebot/​softwareinstallation/​|due to your requests]], but will never really cover everybody'​s favorite tool. 
- 
-Notwithstanding,​ own [[software:​topical:​lifescience:​ngs_read_mapping_tools#​Comparison_Benchmarks|benchmarks]] a first impression can be found in [[http://​www.ecseq.com/​support/​benchmark.html|the same blog]]. 
 ===== Software Options ===== ===== Software Options =====
  
Line 16: Line 11:
 ''​bio/​BWA/<​version>''​ ''​bio/​BWA/<​version>''​
  
-=== The Wrapper Script ===+You can find a wrapper to ease your workflow, [[software:​topical:​lifescience:#​standard_mappers|below]].
  
-To leverage the task from 1 (or a few) samples to be mapped to several in parallel, we provide a wrapper script, which is available as a module: ​ 
- 
-''​bio/​parallel_BWA''​ 
- 
-The code is under version management and hosted [[https://​gitlab.rlp.net/​hpc-jgu-lifescience/​seq-analysis|internally,​ here]]. 
- 
-<WRAP center round important 90%> 
-The wrapper script will submit a job, it is not intended to be just within a SLURM environment,​ but rather creates one. 
-</​WRAP>​ 
- 
-Calling ''​parallel_BWA -h''​ will display a help message with all the options, the script provides. Likewise, the call ''​parallel_BWA --credits''​ will display credits and a version history. 
- 
-The script, after loading the module, can then be run like: 
- 
-<code bash> 
-$ parallel_BWA [options] <​referencedir>​ <​inputdir>​ 
-</​code>​ 
- 
-<WRAP center round important 90%> 
-**Limitations**:​ 
- 
-  * The wrapper recognizes FASTQ files with suffixes "''​*.gz''",​ "''​*.fastq''"​ or "''​*.fq''"​ and will allways assume FASTQ files (compressed or uncompressed). 
-  * The number of processes (and therefore nodes) is limited to the number of samples. 
-  * The wrapper only works for paired end sequencing data, where the file tuples are designated with the following strings "''​_1''"​ and "''​_2''"​ or "''​_R1''"​ and "''​_R2''",​ respectively. 
-  * BWA does not scale well to big data. It is better to split input to chuncks of ~1GB (take this with a grain of salt: there are not scaling tests, yet) 
-  * BWA does not scale well beyond a NUMA block (8 threads on Mogon I) 
-  * There are only a few options, as internally the wrapper calls ''​bwa mem''​ (or ''​bwa aln''​ in the single end case) and only sets up a few things to yield performance. 
-</​WRAP>​ 
- 
-About Arguments: 
- 
-  * ''​referencedir''​ needs to be the (relative) path to a directory containing an indexed BWA reference 
-  * ''​inputdir''​ needs to be a (relative) path to a directory containing all inputs. Subdirectories and files containing the string ''​unpaired''​ are ignored; this is to support preprocessing with the [[software:​topical:​lifescience:​trimmomatic|trimmomatic module]]. 
- 
-The options: 
-  * ''​parallel_BWA''​ attempts to deduce your SLURM account. This may fail, in which case ''​-A,​ --account''​ needs to be supplied. 
-  * ''​-N,​--nodes''​ allows to reserve more than 1 node (the default). This may speed up the screening; see the limitations above. 
-  * ''​-d,​--dependency'',​ list of comma separated jobids, the job will wait for to finish 
-  * ''​-l,​--runlimit'',​ this defaults to 300 minutes. 
-  * ''​-p,​--partition'',​ the default is ''​nodeshort''​ or ''​parallel''​ on Mogon2, no smp-partition should be choosen. 
-  * ''​-t,​--threads'',​ BWA can work in parallel. Please consult the manual. The default is 8. 
-  * ''​-o,​--outdir''​ output directory path (default is the current working directory) 
-  * ''​--single''​ (no arguments) to evaluate single end data 
-  * ''​--args''​ to supply additional flags, e. g. ''​--args="​-l 1024 -n 0.02"''​ for BWA - note the quotation marks, they are necessary. 
-  ​ 
-Output: 
- 
-  * Per input tuple (paired sequencing data, only) a BAM file with the prefix of the input will be written. In the case of single end data, there will be one output per input, only. 
  
 ==== BarraCuda ==== ==== BarraCuda ====
Line 77: Line 24:
 See [[:​software:​topical:​lifescience:​ngs_read_mapping_tools#​gpu-based|below for a wrapper script]] to ease your workflow. See [[:​software:​topical:​lifescience:​ngs_read_mapping_tools#​gpu-based|below for a wrapper script]] to ease your workflow.
  
 +==== Minimap2 ====
 +
 +[[https://​github.com/​lh3/​minimap2|Minimap2]] is supposed to be a replacement for ''​bwa mem''​. Modules are installed under 
 +
 +''​bio/​minimap2''​
  
  
Line 82: Line 34:
  
 [[https://​academic.oup.com/​bioinformatics/​article/​28/​20/​2592/​206947|RazerS 3]] as [[:​software:​topical:​lifescience:​ngs_read_mapping_tools#​yara|yara]] is part of the seqan modules: [[https://​academic.oup.com/​bioinformatics/​article/​28/​20/​2592/​206947|RazerS 3]] as [[:​software:​topical:​lifescience:​ngs_read_mapping_tools#​yara|yara]] is part of the seqan modules:
 +
 +''​bio/​SeqAn/<​version>''​
 +
 +You can find a wrapper to ease your workflow, [[software:​topical:​lifescience:#​standard_mappers|below]].
  
  
Line 87: Line 43:
  
 [[https://​www.nature.com/​articles/​nmeth.1923|Bowtie2]] is a well known read aligner with a focus on gapped alignments. [[https://​www.nature.com/​articles/​nmeth.1923|Bowtie2]] is a well known read aligner with a focus on gapped alignments.
 +
 +Module(s) can be found at:
 +
 +''​bio/​Bowtie2/<​version>''​
 +
 +You can find a wrapper to ease your workflow, [[software:​topical:​lifescience:#​standard_mappers|below]].
  
 ==== STAR ==== ==== STAR ====
  
-=== The Wrapper Script ===+[[https://​www.ncbi.nlm.nih.gov/​pubmed/​23104886|STAR]] is a well known mapping tool for RNA-Seq data.  
 + 
 +Module(s) can be found at: 
 + 
 +''​bio/​STAR/<​version>''​ 
 + 
 +You can find a wrapper to ease your workflow, [[software:​topical:​lifescience:#​standard_mappers|below]].
  
 ==== segemehl ==== ==== segemehl ====
Line 100: Line 68:
 </​WRAP>​ </​WRAP>​
  
 +The currently installed module is
 +
 +''​bio/​segemehl/​0.2.0-foss-2018a''​
 +
 +==== TopHat ====
 +
 +[[https://​ccb.jhu.edu/​software/​tophat/​index.shtml|TopHat]] is a fast splice junction mapper for RNA-Seq reads.
 +
 +Module can be found at:
 +
 +''​bio/​TopHat/<​version>''​
 +
 +
 +<WRAP center round info 90%>
 +This program is not yet incorporated into the wrapping module.
 +</​WRAP>​
 ==== yara ==== ==== yara ====
  
Line 115: Line 99:
  
 Most mapping tools adhere to this paradigm: They work on a reference (directory). They are, therefore, easily wrapped, such that the reference can be staged-in to a node-local directory (e.g. a [[:​ramdisk|ramdisk]]) in order to avoid random I/O (and consequently prolonged run times) on the parallel file system. Most mapping tools adhere to this paradigm: They work on a reference (directory). They are, therefore, easily wrapped, such that the reference can be staged-in to a node-local directory (e.g. a [[:​ramdisk|ramdisk]]) in order to avoid random I/O (and consequently prolonged run times) on the parallel file system.
 +
 +
 +Now, to leverage the task from 1 (or a few) samples to be mapped to several in parallel, we provide a wrapper script, which is available as a module: ​
 +
 +''​bio/​parallel_MappingTools''​
 +
 +The code is under version management and hosted [[https://​gitlab.rlp.net/​hpc-jgu-lifescience/​seq-analysis|internally,​ here]].
 +
 +<WRAP center round important 90%>
 +The wrapper script will submit a job, it is not intended to be just within a SLURM environment,​ but rather creates one.
 +</​WRAP>​
 +
 +Calling ''​MapperWrapper -h''​ will display a help message with all the options, the script provides. Likewise, the call ''​MapperWrapper --credits''​ will display credits and a version history.
 +
 +The script, after loading the module, can then be run like:
 +
 +<code bash>
 +$ MapperWrapper --executable=<​executable>​ [options] <​referencedir>​ <​inputdir>​
 +</​code>​
 +
 +<WRAP center round important 90%>
 +**Considerations**:​
 +
 +  * The wrapper recognizes FASTQ files with suffixes "''​*.gz''",​ "''​*.fastq''"​ or "''​*.fq''"​ and will always assume FASTQ files (compressed or uncompressed). [[software:​topical:​lifescience:#​yara|yara]] accepts bzipped files, too.
 +  * The number of processes (and therefore nodes) is limited to the number of samples.
 +  * The wrapper only works for paired end sequencing data, where the file tuples are designated with the following strings "''​_1''"​ and "''​_2''"​ or "''​_R1''"​ and "''​_R2''",​ respectively.
 +  * There are only a few options, as internally the wrapper calls ''​bwa mem''​ (or ''​bwa aln''​ in the single end case) and only sets up a few things to yield performance. Likewise a switch for single and paired end data exists for other mappers.
 +</​WRAP>​
 +
 +About Arguments:
 +
 +  * ''​referencedir''​ needs to be the (relative) path to a directory containing an indexed BWA reference
 +  * ''​inputdir''​ needs to be a (relative) path to a directory containing all inputs. Subdirectories and files containing the string ''​unpaired''​ are ignored; this is to support preprocessing with the [[software:​topical:​lifescience:​qc|quality check module]].
 +
 +The options:
 +  * ''​MapperWrapper''​ attempts to deduce your SLURM account. This may fail, in which case ''​-A,​ --account''​ needs to be supplied.
 +  * ''​--verbose,​--no-verbose'' ​ verbose execution (off by default)
 +  * ''​--executable'' ​ mandatory argument to designate the executable possible arguments: ''​bwa'',​ ''​bowtie2'',​ ''​yara''​
 +  * ''​-d,​--dependency'',​ list of comma separated jobids, the job will wait for to finish
 +  * ''​-l,​--runlimit'',​ this defaults to 300 minutes.
 +  * ''​-p,​--partition'',​ the default is ''​nodeshort''​ or ''​parallel''​ on Mogon2, no smp-partition should be choosen.
 +  * ''​-o,​--outdir''​ output directory path (default is the current working directory)
 +  * ''​--tag''​ optional tag/prefix for logfiles and directories
 +  * ''​--groups''​ set to provide a lists of read group tags (len(groups) must equal to No. of files)
 +  * ''​--single''​ (no arguments) to evaluate single end data
 +  * ''​--args''​ to supply additional flags, e. g. ''​--args="​-l 1024 -n 0.02"''​ for BWA - note the quotation marks, they are necessary.
 +  ​
 +Output:
 +
 +  * Per input tuple (paired sequencing data, only) a BAM file with the prefix of the input will be written. In the case of single end data, there will be one output per input, only.
 +
 +=== Generating Read Group Tags ===
 +
 +Read group tags can be inserted with the ''​--groups''​ flag((From version 0.6 onward.)). The tags are supplied as a list on the command line. An example code to generate a tag list for consecutively ordered tags would be:
 +
 +<code bash>
 +# defining the input directory appropriately in a master script:
 +inputdir=/​some/​path/​to/​your/​data # assuming '​_R1'​ defines the forward reads in a paired end scenario
 +
 +# a template - may deviate from project to project
 +template="​@RG\tID:​+ID+\tLB:​unknown_lb\tPL:​illumina\tSM:​sample+ID+"​
 +# the tag list to be generated
 +tags=""​
 +# number of samples - this snippet could be integrated in a script ​
 +nsamples=$(find $inputdir -name '​*_R1*.fastq'​ | grep -v unpaired | wc -l)
 +# now the actual generation:
 +for ((i=1; i <= $nsamples; i++)); do
 +  tags="​$tags $(sed -e "​s/​+ID+/​$i/​g"​ <<<​ $template)"​
 +done
 +</​code>​
  
  
Line 134: Line 188:
  
 <WRAP center round important 90%> <WRAP center round important 90%>
-**Limitations**: +**Considerations**: 
-  * See the parallel_BWA wrapper +  * See the [[software:​topical:​lifescience:​ngs_read_mapping_tools#​standard_mappers|"​standard"​ Mappers]] 
-  * Also: The script will only use the ''​m2_gpu''​ partition and therefore needs an account with the ''​m2_''​ prefix.+  * Also: The script will only use the ''​m2_gpu''​ partition and therefore needs an account with the ''​m2_''​ prefix((This is because development to support the wild "​zoo"​ of hardware and partition setting is hardly worth the effort for this software, as tests show that standard bwa (properly mapped) outperforms the gpu version.)).
 </​WRAP>​ </​WRAP>​
  
Line 142: Line 196:
 About Arguments: About Arguments:
   * ''​referencedir''​ needs to be the (relative) path to a directory containing an indexed BWA reference. No symbolic links are allowed.   * ''​referencedir''​ needs to be the (relative) path to a directory containing an indexed BWA reference. No symbolic links are allowed.
-  * ''​inputdir''​ needs to be a (relative) path to a directory containing all inputs. Subdirectories and files containing the string ''​unpaired''​ are ignored; this is to support preprocessing with the [[software:​topical:​lifescience:​trimmomatic|trimmomatic ​module]].+  * ''​inputdir''​ needs to be a (relative) path to a directory containing all inputs. Subdirectories and files containing the string ''​unpaired''​ are ignored; this is to support preprocessing with the [[software:​topical:​lifescience:​qc|quality check module]].
  
  
Line 156: Line 210:
  
 ===== Comparison Benchmarks ===== ===== Comparison Benchmarks =====
 +
 +
 +Notwithstanding,​ own [[software:​topical:​lifescience:​ngs_read_mapping_tools#​Comparison_Benchmarks|benchmarks]] a first impression can be found in [[http://​www.ecseq.com/​support/​benchmark.html|the same blog]].
  
 <WRAP center round todo 65%> <WRAP center round todo 65%>
 This part needs some more time to be finished .... This part needs some more time to be finished ....
 </​WRAP>​ </​WRAP>​
software/topical/lifescience/ngs_read_mapping_tools.1544697303.txt.gz · Last modified: 2018/12/13 11:35 by meesters