software:topical:lifescience:sdi

Sorting, Deduplication and Indexing of SAM/BAM Output

The canonical tool for handling SAM/BAM format is SAMtools.

Modules are available on both clusters as: bio/SAMtools

You can find a wrapper to ease your workflow, below.

Sambamba aims to be fast (almost drop-in) alternative to SAMtools.

Modules are available on both clusters as: bio/sambamba

You can find a wrapper to ease your workflow, below.

During testing for the wrapper module, sambamba was found to be not reliant and not informative, when crashing. E.g. it was reporting that the stream was broken, whilst the file system was ok and other tools never experienced nor reported this issue.

Picard tools are yet another alternative, with additional features, not covered in the wrapper script (see below).

Modules are available on both clusters as: bio/picard

To leverage the task from 1 (or a few) samples to be sorted, deduplicated and indexed to several in parallel, we provide a wrapper script, which is available as a module:

bio/parallel_SDITools

The code is under version management and hosted internally, here.

The wrapper script will submit a job, it is not intended to be just within a SLURM environment, but rather creates one.

Calling SDIWrapper -h will display a help message with all the options, the script provides. Likewise, the call SDIWrapper will display credits and a version history.

The script, after loading the module, can then be run like:

$ SDIWrapper [options] <bamdir>

Limitations:

  • wrapping sambamba is supported, yet crashed may occur.

About Arguments:

  • bamdir can be relative path to the top-level directory containing bam (or sam) files.

The options:

  • –executable, mandatory argument to designate the executable
    1. possible arguments: samtools, sambamba, picard
    2. check is case insensitive
    3. defaults to 'samtools'
  • -l,–runlimit, this defaults to 300 minutes.
  • -p,–partition, the default is nodeshort or parallel on Mogon2
  • -A,–account, SLURM account
    1. default is the last submit account
    2. an error is triggered if none specified nor can be deduced
  • -t,–threads, number of threads the executable should use (defaults are application dependend)
  • -a,–args, arguments otherwise not set by the wrapper
    1. the defaults of the choosen executable apply for unset arguments
    2. will superseed the defaults, e.g. 'LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36' for trimmomatic
  • -d,–dependency, list of comma separated jobids, the job will wait for to finish
  • -o,–outdir output directory path (default is the current working directory)
  • –constraint, only on Mogon2, defaults to 'broadwell'
  • –tag a jobtag (default is decuced by naming scheme)
  • –credits shows credits and version history
  • –version shows the version number
  • -h,–help Prints help
  • *Specific Processing Arguments **
  • –dnd do not deduplicate (default is to deduplicate)
  • –dni do not index (default is to index)
  • –ks keep the sorted bam file (default is to discard this file)
  • –dkd do NOT keep the deduplicated bam file (default is to keep this file)
  • –SE will supply '-s' (for single end data) for samtools index (only when executable is samtools)

The output naming scheme:

Within the specified (or default) output directory, the resulting BAM (and index) files will be stored with names according to this scheme:

  • _sorted_index.bai - ending for index files
  • _sorted_dedup.bam - for sorted and deduplicated files
  • _sorted.bam - if only sorted and not deduplicated

Note the different application work with different default compression ratios. Hence, the output size may differ for identical data - if anyone cares to compare the different tools.

  • software/topical/lifescience/sdi.txt
  • Last modified: 2019/02/13 12:28
  • by meesters