Using a Workflow Management System
Functionality
We provide a modulefile
tools/mflow
which is a thin-wrapper around the snakemake workflow system. While it is possible to use snakemake after loading the module directly, mflow
provides a curated cluster configuration for each of its supported workflows.
Using MFLOW
To obtain usage help, you can run
$ mflow -h
then each option supported by mflow
is shown.
Selecting Workflows and obtaining Information about them
To see which workflows are available run:
# will list all available workflows - sorted by topic $ mflow --list-workflows
Each supported workflow will provide an annotation file with supplementary information, this is the name in the second column displayed by '–list-workflows'. You can look it up with
$ mflow --show-annotation <annotation>
Most workflows require you to manually edit and provide a configuration file. This file contains all the information about the input(s) and the used software modules. While snakemake uses Conda (mostly: Bioconda) packages, in an HPC environment it is better to use environment modules due to performance reasons.
We provide sample configurations, which can be obtained with
$ mflow --show-configuration <workflow name>
This will print the configuration onto the terminal and write a file <workflow name>.yaml
for you to be edited according to your input.
Running Workflows - Quick Start
As for every curated workflow a cluster specific configuration is given, most workflows will simply require
- to provide the SLURM account of a project
- to edit and select a workflow specific configuration
- select the desired workflow, itself.
A typical call looks like
$ mflow -A <account> -w <workflow> --configfile <workflow specific configuration file>
Some workflows will require to provide a rule specific cluster configuration, too. This can be done using –cluster-config <path to workflow specific configuration file>
Reproducibility
It may see cumbersome at first - manually editing a configuration file! But it not only serves the purpose of providing and selecting all necessary input. The file also provides a document for you: Which software versions have been used? Which was the selected input exactly? Etc.Coping with Prolonged Runtimes
As soon as you log out or the terminal running the workflow loses its connection, the workflow is aborted. Already submitted jobs are not impacted, they keep running. However, the workflow will not be aware of this and might re-submit, when triggered again.
It is also possible to start mflow in nohup mode:
$ mflow --nohup ...
If you invoke mflow
in this mode, there will be no further output. snakemake
will run in the background. Use this only for well established workflows, not whilst designing a workflow.
Best Practices
Where to work & Reporting Errors
Consider working in your home directory: All temporary files are deleted automatically after 10 days. Working in your home directory will therefore prevent cluttering your groups project directory.
Please consider the difference between cluster related issues / errors and workflow related errors. In order to sort out the issues and to come to a quick solution:
- Mail all cluster related issues to our HPC ticket system or approach us on our mattermost channel.
- If there is an error in a workflow, please try to comprehensively summarize it and open an issue on the project page. To do this, click 'New Issue'.
- Indicate the error message, provide context and show the input configuration and the error log file (attach the respective files).
- Please always use the current
mflow
version - otherwise support might not be possible to grant.
Testing Workflows and Other Snakemake Instructions
To pass parameters to snakemake
itself use
$ mflow ... -- <list of snakemake parameters>
A useful application is running
$ mflow ... -- --dry-run
to test a given workflow without executing it. Note, that for a dry run some parameters as account, configuration and workflow need be present, too.
Handling Errors
If a workflow or workflow job is aborted, an incomplete file can result. A rerun can be triggered with
$ mflow ... -- --rerun-incomplete
Job Output
Each workflow will write its (scientific) output to the locations specified in the configuration file. Curated workflows differentiate between
- cached output, e.g. read mapping indices, dowloaded reference / input files, etc. This between workflow caching saves time and curated worklows ensure this by their layout
- temporary output files - these intermediate files are to be deleted once they are not needed anymore as an input. Those files can easily be re-generated and are to be temporarily stored on the scratch file system.
- final results as specified in the workflow specific configuration.
Provided Workflows
Topic | Workflow Name | Core Applications |
---|---|---|
Structure Based Ligand Screening | StructureBasedScreening | OpenBabel, Modeller, VinaLC |
ProteoTranscriptomics | ProteoTrans | Blast, MaxQuant, Trinity |
Call for Collaboration
mflow development takes place at the RLP gitlab server. All contributions are welcome. To contribute you have a number of options:
- fork the project, edit the code and create a merge request.
- This applies for contributing new workflows, too.
- Get in touch with us to start a new co-supervised Bachelors- or Masters thesis together.
- Contribute to documentation - here in the wiki or writing issue reports.
Reporting Issues
Any workflow-related issues or issues of mflow
itself should be reported on its project page for better overview and tracking.
HPC-related issues can be reported using the usual channels, our mattermost channel or via mail to the HPC group: hpc@uni-mainz.de.