DMTN-025: A survey of workflow management systems

  • Mikolaj Kowalik,
  • Hsin-Fang Chiang,
  • Greg Daues and
  • Rob Kooper

Latest Revision: 2018-06-11

Scope of document

This document summarizes the results of the survey regarding viable candidates for a workflow management system (WMS) to be used by LSST Batch Production Services [CONOPS]. A WMS helps automate orchestration and execution of large batches of tasks on available, possibly distributed, computational resources.

Introduction

A common pattern in scientific computing involves the execution of many computational or data manipulation tasks. Those tasks are usually coupled, i.e., data produced by of one task are consumed by one or more other tasks. Thus execution of such tasks often requires a non-trivial coordination (orchestration) to satisfy their data dependencies. As the numbers of tasks and/or data sets they operate on scale up, it is desirable to divide the workload among all available, nowadays usually distributed, computing resources. This invariably introduces additional levels of complexity related to load balancing, data storage and transfer, monitoring tasks’ execution and handling their failures. Attempts to automate aforementioned aspects of the orchestration process have led in recent years to the proliferation of frameworks and libraries commonly known as workflow management systems. This site alone lists over 70 existing, open-sourced WMS and is far from being complete.

At the time of writing this report (Aug 2016), it is hard to point to any clearly defined “winners” or, at least dominating, standards in the field. Different projects focus on different aspects of an orchestration which are usually dictated by direct needs of their authors and/or were designed to work in certain “ecosystems”. Such issues cast some doubts on the possibility of their application in other domains or interoperability with technologies different than intended. On top of that, the maturity of those projects, their support from developers and community, and available funding varies significantly, making a selection of a WMS satisfying needs of the LSST project regarding Batch Production Services far from being straightforward. In order to being able to make, at least, educated guess which WMS would work best for the LSST project, we decided to take a closer look at 8 of the most “serious”-looking WMS’s and survey them all in more detail against a set of selected criteria we deem important considering project’s needs.

Finally, it is worth emphasising that this current survey has a preliminary character. Its goal is not to select a future workflow management system for the LSST project but to select 3 or 4 close contenders for further, more elaborate, testing with the LSST stack in HPC environment.

Methodology of the survey

Our goal is to pick a WMS that will work in the batch system of LSST. To narrow down the list of potential candidates to a smaller subset we have looked at past and upcoming XSEDE conferences and workshops [XSEDE]. The main reason to look at the XSEDE workshops is that all of the workflows discussed at these workshops are capable of running on the XSEDE resources and thus are ready for large scale executions (Makeflow, Pegasus, RADICAL-Pilot, and Swift). Besides the workflows discussed there we considered some of the more recent workflows systems that have been open sourced by different large companies (Airflow, CloudSlang, and Pinball). Finally we looked at other fields such as high energy physics and picked one of the prominent workflows from there (Panda). Each of the candidates was then reviewed against the set of criteria which may be divided into three main categories described in more detail below.

Technical merits

Before we compared the different workflow systems we created a set of criteria that were used to evaluate the different systems. These criteria focused on making sure that the workflow system that was picked is established and well-proven, since it should last for the project duration, and we want to make sure we can get help from the community both with questions, help with solving problems and the ability to add back to the workflow system chosen.

Flexibility in workflow generation

Workflow management systems use wide variety of methods to create workflows; some uses dedicated GUIs, others follow “configuration as a code” approach and the workflow description is a script written in Python. Those methods offer different degrees of freedom and flexibility for the authors of the workflow and therefore we consider this aspect an important characteristic of a workflow management system.

Interoperability

As we already mentioned in the Introduction some of the existing workflow management systems were designed to work in certain “ecosystems”. As the LSST project’s execution environments may differ from the one a WMS was designed for, it is essential to know what technologies a given system can cooperate with currently or how easy it can be extended to support others.

Scalability

As workflows varies in size and complexity, it is important that a workflow management system is able to handle both small and large workflows (in terms of number of tasks) equally well, e.g., that memory requirements grows sufficiently slow with increasing number of tasks in a workflow.

Performance

Though scheduling tasks and tracking their provenance is not a particularly CPU-intensive process, we wanted to be sure that under a large load a workflow management system will not become a bottleneck and will utilize available resources efficiently.

Reliability

With a large number of tasks running on distributed computational resources some failures are inevitable. Thus any decent workflow management system should support automatic retries, i.e., rerun the failing task, to be able to deal with transient errors (e.g. a temporary network downtime). It should also track the provenance of tasks to allow for a workflow restart in case of more permanent execution errors (e.g. runtime errors due to flawed input).

Monitoring tools

Workflow management systems provide logs which can be used to monitor tasks execution. However, with numerous large and complex workflows, going through logs is not a particularly useful method to monitor workflow execution in a real time. That is why in our survey we decided to take into account existence of any extra monitoring tools such as dashboards.

Available support

Considering the scale, complexity, and longevity of the LSST project, we are convinced that purely technical merits cannot be the only ones to be taken into account during the evaluation process. After all, every piece of software has bugs which need to be fixed, configuration issues which may not be covered in documentation, and so on. Thus, we also decided to consider things like:

  • communication channels allowing tracking of a current project’s development and reporting of issues;
  • the number of active developer and their responsiveness in addressing issues reported by users;
  • the number of use cases, size of the project’s community and its activity on available project’s channels;
  • and last but not least, its funding.

Overall “user experience”

Finally, we decided to estimate the overall overhead associated with getting started with a given workflow management system by installing (almost) each of the surveyed frameworks on our workstations. That step allowed us to address following questions:

  • How thorough and complete is project’s documentation? Is a user able to get all necessary information from it or does he or she have to use the “Use the Force, Read the Source!” principle on regular basis?
  • How complex is framework’s installation and configuration process? Does it have a relatively small number of dependencies or, on contrary, does it require a multitude of external libraries to have all of its advertised features enabled?
  • What is the learning curve associated with creating and executing an example workflow?

Results

We evaluated each project listed earlier based on criteria described in the previous section using available online resources, i.e., white papers, documentation, described use cases, and users’ opinions. Our findings is best summarized in Table 1 (entries marked with a question mark denote pieces of information we were not able to determine at the current level of scrutiny). For sake of completeness, we also provide farther below more in-depth reviews concerning each investigated workflow management system for an interested reader.

Table 1 Comparison of surveyed workflow management systems.
  Airflow CloudSlang Makeflow Pinball PanDA Pegasus RADICAL Pilot Swift
Flexibility in workflow generation high high low not tested medium medium high high
Interoperability
  • Hive,
  • Mesos
  • CoreOS,
  • Docker,
  • Heroku
  • Condo
  • SLURM,
  • PBS,
  • SGE,
  • Torque
? Condor
  • Condor,
  • XSEDE,
  • SGE,
  • EC2,
  • Condor,
  • SLURM,
  • PBS
  • SGE
  • Condor,
  • SLURM
  • PBS
  • SGE
Performance high high high high high high high high
  • Retries
  • Restarts
  • Replications
  • yes
  • yes
  • ?
  • no
  • yes
  • no
  • yes
  • yes
  • no
  • yes
  • ?
  • ?
  • ?
  • ?
  • ?
  • yes
  • yes
  • yes
  • yes
  • no
  • no
  • yes
  • yes
  • yes
Logging yes yes yes yes yes yes yes yes
Monitoring tools GUI no GUI GUI GUI GUI no no
Scalability good good good good good good good good
Load balancing no ? yes ? yes yes ? yes
Prioritization yes ? ? ? yes yes yes yes
Provenance tracking yes ? ? ? yes yes yes yes
“Pilot job” functionality yes ? yes yes yes yes yes yes
Data transfer yes [1] ? yes ? yes yes yes yes
Data caching no ? yes ? ? ? ? no
Language of implementation Python Java C Python Python, JavaScript Python, Perl, shell Java Python Java
Dependencies
  • MySQL,
  • Celery,
  • message broker
None None MySQL Condor
  • Condor,
  • Ant,
  • OpenSSL
MongoDB Globus
OS requirements Linux / OSX / Win Linux / OSX / Win Linux / OSX / Win Linux / OSX / Win Linux Linux Linux / OSX / Win Linux OSX / Win
Licence type Apache v2 Apache v2 GNU GPL v2 Apache v2 Apache v2 Apache v2 MIT Apache v2
“DOB” 2015 2014 2009 2015 2005 2001 2014 2007
Active development yes yes yes yes no yes yes yes
Latest release May, 2016 July, 2016 May, 2016 ? 2015 Apr., 2016 May, 2016 Aug., 2015
Documentation quality good very good good poor good good good very good
Installation complexity medium easy easy not tested not tested easy easy easy
Learning difficulty low low low not tested high medium low high
Popularity high high low low low [2] low [3] low medium

Footnotes

[1]Staging data in and out must be included explicitly in the DAG representing a workflow with help of so called Transfer operators.
[2]Number of reported use cases is small but some of those projects are huge!
[3]Same as above.

Summary

LSST Batch Production Services [CONOPS] are about executing “[…] campaigns on computing resources to produce the desired LSST data products […]” where campaigns are defined as sets of pipelines (ordered ensembles of computational steps), inputs they are being run against, and methods of handling their outputs. As campaigns can vary in size and complexity, we will face a non-trivial problem of orchestrating their execution on potentially distributed computational resources while satisfying the data-dependencies. As this pattern is quite common in many scientific and business applications, there exist many frameworks called “workflow management systems”, whose sole purpose is to automate such processes.

To select an optimal workflow management system for LSST Batch Services we made a detailed, multi-aspect survey of a few available workflow management systems to estimate their usefulness in implementing objectives of LSST batch operations [CONOPS]. Based on our findings, we have selected three of them: Airflow, Pegasus, and PanDA for further tests in which the chosen candidates will be used to orchestrate execution of a few specific LSST pipelines (yet to be determined) in NCSA’s HPC environment.

Follow-on to the Summary for Status Oct 2019

Following this workflow management system survey, the first BPS prototype workflows were implemented in Pegasus for runtime execution. These workflows were leveraged against some small test datasets (e.g., testdata_ci_hsc package). These were executed on the ‘Verification Cluster’ of servers run by NCSA for LSST DM by first submitting HTCondor glide-in jobs to the Slurm scheduler (‘pilot jobs’ that effectively transform the Slurm compute nodes into a Condor pool.) The BPS prototype workflows were also the basis for small scale tests in the LSST DM AWS POC work.

Within its workflow execution Pegasus utilizes HTCondor DAGMan; in some sense DAGMan is the inner engine driving the work of the Directed Acyclic Graphs, with Pegasus serving as a layer of management/organization, monitoring, and additional features/plugins, etc., wrapping HTCondor DAGMan and the Condor layer at which jobs run.

Following these small scale tests, the BPS workflows are now targeting support for large scale production and operations. The BPS effort is taking this opportunity to undergo a recasting, with the plan to construct and wrap HTCondor DAGMan workflows directly. These DAGs will submit and manage HTCondor jobs to pools of resources. This offers a simplification to the development, reducing the number of layers/tools as the BPS is integrated with the Science Dags, Quantum Graphs, etc., of the Gen3 system. The DM team will also utlize enhanced direct collaboration with the HTCondor team (e.g., in efforts such as the DM AWS POC) to assist with the integration with HTCondor DAGMan workflows. The HTCondor DAGMan based BPS workflows will be tested on the nascent and increasing pool of HTCondor batch system nodes at NCSA.

In addition to DM work of prototyping BPS workflows against HTCondor clusters running at NCSA, the DM AWS POC effort is investigating the execution of workflows on HTCondor pools constructed via the condor_annex mechanism within AWS cloud. Testing has demonstrated that full HTCondor pools, central manager and worker nodes, can be launched and prototype BPS workflows executed for the first case of testdata_ci_hsc. The condor_annex approach continues to evolve and improve in its usability. In the past (at the HTCondor 8.6 level) it was advisable to utilize base AMIs that the HTCondor team provided / made publically available. In current work at the HTCondor 8.9.3 level, we see that worker nodes AMIs conformable to the condor_annex can be readily created by installing a few condor packages, and that customized configuration can be deployed to worker nodes at the time of condor_annex invocation. The DM AWS POC effort will continue tests at increasingly larger scales. In local testing we can demonstrate the annex of AWS worker nodes to an on-premises central manager at NCSA, though full security and trust between such AWS workers and LSST DM HTCondor resources at NCSA requires further investigation.

Addendum

Currently a large process is underway to change how the middleware operates. This new version is the Gen3 Middleware and consists of a rewrite of Butler as well as the addition of SuperTasks and pre-flight. The pre-flight will use the Gen3 Butler, as well as the SuperTask to create a quantum graph (workflow like graph) that can be executed to run the actual pipelines. This work is in the implementation phase, depending on the results and changes of this Gen3 Middleware we will re-evaluate some of the workflow systems to see which one fits best with this new middleware and the project requirements.

During this time we have briefly looked at the Parsl software as another option of a workflow system as well as successfully prototyped using the DESDM framework in LSST operations additionally porting it to Python 3 to be ready as another solution.

DESDM Framework

In the past year we have looked at the DESDM framework as one of the options to execute the pipelines in an operations environment. At NCSA we have experience running this as part of the DES project running in production for the past 5 years for nightly processing as well as producing data releases. NCSA has trained operators to run the software, analyze the results and retry select sections of the workflow. The system is ready to use and can execute tasks in parallel on a HTCondor cluster. Code is available for glide-ins, allowing an existing cluster (such as lsst-dev) to be configured to allow for execution of tasks from the DESDM framework. The framework will collect statistics, enabling us to do performance testing, as well as provenance tracking allowing us to find out exactly for each data file generated how it was generated, with what inputs and by whom. In the past months we have spend time on converting the existing code base from Python 2 to Python 3 to fit into the LSST software stack. We also wrote LSST-specific plugins to enable running of the Gen2 LSST pipelines. We have done tests using both the Python 2 version running the LSST software stack and RC dataset as well as using the python 3 version and the same dataset to test whether the DESDM framework is ready to run the existing LSST pipelines.

Parsl

Parsl (Parallel Scripting Library) is a scripting library written in Python. The current version is 0.5.1. Parsl has been under development for the past 2 years and is developed as a collaboration between The University of Chicago, Argonne National Laboratory and the University of Illinois. The code is released under the Apache 2.0 License and is available at GitHub. Parsl is a workflow system that can manage the data movement and execution of tasks on different hardware such as cloud, clusters and other resources. Tasks can be executed in parallel and will only be executed when required data is ready. The tasks that are executed are either bash scripts or Python functions. The functions will take arguments that are supplied during the definition of the workflow, such as parameters or outputs from previously executed tasks. The workflow system uses futures to setup the tasks to be executed, and will only execute those tasks that are needed for the final result. If a task requires the results from another task it will wait for the previous task to finish. Files in Parsl are abstracted allowing for the movement and placement of the files anywhere in the filesystem. Parsl will move the file to the required compute resource and the task will use the special Parsl file object to read and write the file from the disk, alleviating it from knowing where the file is actually stored. Tasks written in Python have some limitations and require all modules to be explicitly imported in the code, and global variables do not exist. Workflows in Parsl are programmatically defined and are coded using the Python language, using Parsl specific functions. Tasks that are to be executed in this workflow system are annotated with the @Apps annotation. Parsl is written from scratch and is influenced by the SWIFT language which we have looked at in more detail. We have not looked into Parsl in detail yet to see how it would fit in the LSST project.

Appendix A: Airflow

Overview

Airflow is a platform to programmatically author, schedule and monitor workflows developed and used internally by Airbnb. It is written in Python and released under Apache License, v2.0. The project was started in the fall of 2014 by Maxime Beauchemin at Airbnb and officially brought under the Airbnb Github and announced in the spring of 2015. The project joined the Apache Software Foundation’s incubation program in the winter of 2016.

Features

Flexibility

Airflow’s workflows are specified as a Python code which offers a great deal of flexibility when designing workflows. On top of that, Airflow supports Jinja templating. Thus workflows can be generated programmatically. Additionally, it has a simple plugin manager that can integrate external features into its core if required.

Monitoring

Airflow comes with the web dashboard allowing a user to visualize, manage and monitor execution of available workflows.

Performance

We were not able to find any solid performance metrics of Airflow.

Interoperability

Airflow comes fully loaded with ways to interact with commonly used systems like Hive, Presto, MySQL, HDFS, Postgres and S3. Sources I came across do not mentioned more HPC-oriented technologies/protocols though, in theory, it may be easily extensible due to its modular architecture.

Reliability

Airflow supports retrying failed tasks. However, it seems that workflow restarts are not directly supported although this blogpost describes how to use of subdags to emulate them.

Scalability

Project’s website claims that due to its modularity Airflow can orchestrate arbitrary number of tasks and workers. In their own words:

Airflow is ready to scale to infinity.

Sadly enough, they do not provide any solid data to support this bold claim. One example we were able to find, mentions running 20’000 tasks per day on a 32-core machine which is not particularly impressive from a HPC perspective.

Support

Currently the projects is being developed and maintained by a group of 9 programmers.

There are two places where an Airflow user can seek assistance: maling list and Gitter channel.

Community gather around the mailing list seems to be rather vibrant. From the moment it was established (Apr, 2016) there were approx. 700 posts. Similarly with Gitter channel, there are quite a few Airflow users who may be asked for help if needed.

Project’s GitHub page lists 39 companies which declared that they are using Airflow as their workflow management system.

Currently the project undergoes incubation at the Apache Software Foundation. Though it means it is not yet fully endorsed by ASF, it suggests it may have a sustainable source of funding in the near future.

User experience

Documentation

Airflow comes with a rather well written tutorial explaining how to create, validate and execute an example workflow. However, the user guide and API reference are somewhat lacking. Information is provided without a clear order and a level of details one would expect from a user guide. Some more advanced aspects (e.g. writing hooks) are not covered at all. Probably the worst is the chapter regarding, rather feature-rich, Airflow’s web UI. It is basically a catalog of screenshot focused more on displaying some its features rather than explaining how to work with it in some organized manner.

Installation and configuration

Besides a working Python installation, the project does not list any external dependencies. Though there are other extra packages available (their complete list can be found here) and they may required some additional software to work.

Airflow installation is pretty straightforward as it can be installed via PyPI using pip. Though initially, recommended installation procedure

pip install airflow

did not allow us to start the scheduler. We had to install Hive operators with

pip install "airflow[hive]"

to make it work.

Hands-on experience

The learning curve does not seem to be particularly steep and simple workflows can be set up with minimal overhead. Though, out of the box, Airflow uses very simplistic, sequential task scheduler. Using more advanced ones (allowing to scale up number of workers) requires setting up a real database backend (e.g. MySQL or Postgres) separately. If workers are to be executed on remote sites, setting up a message broker (e.g. Rabbitmq or Redis) and Celery queue on top of that is required.

Summary

Airflow has two strong points: dynamic workflow generation (“configuration as a code”) and feature-rich web user interface. Additionally, due to its modular design it should be, at least in theory, easily adaptable to one’s needs. However, being written for ETL (Extract, Transform, Load) application it is hard to say how well it will fit in more HPC oriented environment. Despite this objections, it seems to be a viable candidate for LSST Batch Production Services and its further testing is recommended.

Appendix B: CloudSlang

Overview

CloudSlang is a flow-based orchestration tool written in Java for managing deployed applications and released under Apache License, v2.0 compatible with GPL3. The project is composed of three main parts:

  • the CloudSlang Orchestration Engine (Score) written in Java,
  • the CloudSlang language: a YAML based DSL for writing human-readable workflows,
  • the ready-made CloudSlang content: a repository of common tasks and content that integrates with many modern technologies, such as Docker and CoreOS.

Its previous stable version, 0.9.50, was released in March, 2016; current stable is 0.9.60 but we were unable to find the release date.

From project’s documentation:

There are two main types of CloudSlang content, operations and flows. An operation contains an action, which can be written in Python or Java. Operations perform the “work” part of the workflow. A flow contains steps, which stitch together the actions performed by operations, navigating and passing data from one to the other based on operation results and outputs. Flows perform the “flow” part of the workflow.

Features

Flexibility

Specifying workflows with the help of a YAML-like language offers a great deal of flexibility in their creation. Support for loops and conditional executions allows for their dynamic generation too.

Monitoring

An execution log file stores all events, and users can subscribe to events, but CloudSlang does not appear to provide monitoring tools or GUI to visualize the workflow execution.

Performance

Available sources do not provide any data regarding CloudSlang performance metrics.

Interoperability

Apparently, the entire model of the CloudSlang engine assumes that it is being run on a single machine with shared filesystem so in-memory objects like variables can be transferred between actions and flows but beyond that it does not support any forms of data transfer. Additionally, actions and flows are executed exclusively by the engine so their execution cannot be delegated to Condor, PBS, SLURM, etc.

Reliability

The recovery mechanism guarantees that each step of an execution plan will be run, by monitoring status of available workers. The non-responsive workers’ records in the queue get reassigned to other workers that pick up from the last known step that was executed. However, workflow restarts are not implemented.

Scalability

Project’s documentation claims that CloudSlang workflows optimized for high throughput and are horizontally scalable but they do not provide any information how big workflows it can really handle.

Support

Project’s website list 8 developers, HP Enterprise software engineers.

It seems that the most reliable place to report issues or seek help is project’s GitHub website as dedicated google group contains no posts at all.

CloudSlang engine is embedded in HP Operations Orchestration (HP OO) which is by approximately 50 companies from the Fortune 500 list.

User experience

Documentation

CloudSlang comes with an well written documentation which includes an extensive tutorial. The tutorial provided a nice guidance how to create a simple action and a flow and gently introduced more advanced aspects of YAML allowing to create arbitrary complex workflows.

Installation

CloudSlang is written in Java and depends on Java 7 or greater. Additionally, to compile the project from its source Apache Maven >= 3.0.3 should be available.

Installation is simple, CloudSlang can be installed either by downloading precompiled binary or compiling from the source. Docker images with CloudSlang are also available.

Hands-on experience

Due to CloudSlang limitations (see Summary for details) our testing did not go beyond examples from tutorial. Though, it seems, even complex workflows can be created with minimal overhead.

Summary

Despite a nice documentation and web-based UI, CloudSlang is definitely not a is a viable candidate for a LSST workflow management system. First and foremost, CloudSlang uses Jython implementation of Python 2.7 to execute Python actions and Jython is not a drop-in replacement for a standard CPython. Secondly, all the available use cases are mainly focused on automating devops and managing deployed applications, e.g. deploying an app that uses linked containers, cleaning unused docker images, or checking health status of an OpenStack. Thus, we do not think CloudSlang is a viable candidate for a LSST workflow management system.

Appendix C: Makeflow

Overview

Makeflow is a portable workflow manager and allows users to express tasks like writing a Makefile. The project was started in 2009 at the Cooperative Computing Lab, University of Notre Dame by Prof. Douglas Thain and was released under GNU GPL v2 (which is not compatible with GNU GPL v3). Last release (5.4.13) was May 5, 2016.

Makeflow is bundled with other software as the Cooperative Computing Tools (cctools). Besides Makeflow, cctools also has a lightweight distributed execution engine Work Queue, a few specialized execution engines, a personal user-level virtual file system Parrots, and user-level distributed filesystem Chirp.

Features

Flexibility

The syntax of MakeFlow’s workflow specification is similar to the traditional UNIX Make program, but more restrictive because only plain rules are allowed, for example no pattern rules, wildcards, or special variables. A Makeflow script consists of a set of rules, each specifying a set of target files to create, a set of source files needed to create them, and a command to run. Makeflow relies on the file (or directories) dependencies explicitly stated in the rules. There seems to be no support for dynamic workflows.

Monitoring

MakeFlow does not offer any monitoring tools besides logs. However when installed with Work Queue, a dashboard for monitoring tasks currently being executed is available.

Performance

Some benchmark performance tests were done in Albrecht et al. (2012), with several workflow patterns with small and large files, in multiple environments including Work Queue, Condor, Hadoop, SGE, and local. For most tests, Makeflow used with Work Queue achieved better performance than other environments.

Interoperability

Makeflow can be deployed to a variety of different systems without modification, including local execution on a single multicore machine as well as batch systems like HTCondor, Work Queue, Sun Grid Engine (SGE) or its derivatives, Torque, Slurm, Moab, Chirp, or custom-defined batch submission commands. For some batch systems, a distributed file system shared across all nodes of the cluster is assumed. Though MakeFlow handles data transfer and caching in supported systems if used with Work Queue.

Reliability

By maintaining a transaction log, Makeflow can resume a workflow and continue from where it left off.

Scalability

Makeflow can handle \(O(10^{6})\) jobs, with \(O(1000)\) of jobs concurrently, according to their 2012 paper. By using Makeflow with Work Queue, one can distribute the load across a collection of machines.

Support

The software is maintained by one research engineer, with additional contributions from a small group of current graduate students.

Bugs are reported and tracked on project’s Github page. Most issues are reported by group members. New developments seem to depend on student projects.

The applications of Makeflow and cctools are mostly in bioinformatics, but also used in a number of smaller projects in biodiversity, data mining, high energy physics, and molecular dynamics.

Over the years, the lab has been supported through multiple NSF grants, NIH grants, and other funds through Notre Dame.

User experience

Documentation

Makeflow comes with a user manual and tutorial that introduce the Makeflow language and basic usages of the tools. These documentations make starting easy; however the documentation on more advanced features seems to be scarce.

Installation

Installation of cctools is straightforward, using make. There is no need to download any external packages.

Hands-on experience

Overall we found Makeflow lightweight and easy to start. It is easy to create a simple workflow with everything hard-coded. One uses executables and files like in shell scripts. However, Makeflow does not support workflows to be made programmatically.

Summary

Though MakeFlow is easy to start with, workflows cannot be generated programmatically. Thus, generating large, complex workflows would require writing some custom scripts. Additionally, many of its features are only available with cooperation with Work Queue which may limit the execution environments. On top of that, its long term support and development is questionable. Based on those facts, we do not recommend using MakeFlow for LSST Batch Production Services.

Comments

For constructing large or complex workflows, MakeFlow’s developers started a tool called Weaver a few years ago. Weaver allows users to generate workflows in Python, and to be used with Makeflow. However, it is in development phase, not officially supported, and there are no recent developments.

Appendix D: PanDa

Overview

PanDA is a highly scalable and flexible data-driven workload management system designed to meet the demanding computing requirements of the ATLAS experiment at the Large Hadron Collider (LHC) at CERN. PanDA has served ATLAS since 2008 and was released under Apache License, v2.0. While it was originally designed specifically for ATLAS, the core design of PanDA is not experiment specific.

Features

Flexibility

It might be emphasized that PanDA is foremost a workload management system concerned with the large scale management of jobs to a possibly diverse collection of computing sites. PanDA does support workflow with the combination of JEDI and DEFT, though we might note that these are relatively recent additions. They utilize GraphML format to represent a DAG corresponding to a given workflow.

Monitoring

PanDA comes with a web based monitoring and browsing system that provides an interface to PanDA for users and operators.

Performance

PanDa proved itself as a scalable and reliable system capable of handling very large workflow. In 2014, PanDA processed about a million jobs per day, with about 150,000 jobs running at any given time.

Interoperability

The BigPanDA project was established in 2012 as an extension of the PanDA project. BigPanDA strives to generalize PanDA for use by other experiments, and to extend its scope to computing clouds and HPC platforms.

Reliability

Scalability

PanDA is capable of handling very large workflows.

Support

The PanDA analysis user community numbers over 1400.

User experience

Documentation

PanDA comes with an exhaustive documentation describing architecture of all its numerous components. However, it does not provide any tutorial or a user guide. Thus, taking into account project’s scale, there is a significant overhead for newcomers to overcome just to get started with it.

Installation

At this point, we did not installed PanDA locally. Though taking into account that it is a complex, multicomponent systems its installation process may not be trivial.

Hands-on experience

See above.

Summary

PanDA has its primary power as a workload management system. The support for workflow appears solid enough but is provided by more recent additions within noncore components. It has been highly successful for the ATLAS experiment computing on worldwide Grids that include European Grids, the Open Science Grid, etc. However, it is a sizable system with numerous components and it will require significant startup costs and investment by LSST DM. Nevertheless we recommend PanDA for further consideration regarding LSST Batch Operation Services.

Appendix E: Pegasus

Overview

Pegasus workflow management system maps abstract workflow descriptions (DAX) to concrete Directed Acyclic Graph (DAG) representations with resources, coordinates and automates data movement, and submits to execution in various environments. The package, distributed under the Apache License, v2.0 is written in a mix of Python, shell, Perl, and Java. The project started in 2001, as part of the GriPhyN project (Grid Physics Network), with collaborations with LIGO from the beginning. Last release (4.6.1) was Apr 22, 2016; (4.6.0 was Jan 27, 2016).

Features

Flexibility

Pegasus provides a good deal of flexibility in the generation of workflows. They can be created programmatically with help of provided Java, Perl, and Python API which allows to easily write DAX files in XML format. Users can embed a sub workflow inside a workflow, by either specifying a DAX Job in the DAX, or specifying a DAG Job in the DAX.

Interoperability

Pegasus execution environments can be Condor pools, Grid infrastructures such as Open Science Grid and XSEDE, Amazon EC2, Google Cloud, and many campus HPC clusters.

Monitoring

Pegasus has a runtime monitoring daemon that monitors the running workflow, parse the workflow job, and task logs and populates them into a workflow database. The database stores both performance and provenance information. It provides tools for monitoring status, debugging and diagnosing failures, showing summary and statistics, with web dashboard and plots.

Performance

The Pegasus mapper can reorder, group, and prioritize tasks in order to increase overall workflow performance. Most optimization is done at compile time workflow restructuring. Runtime optimization relies on hierarchical workflow, as the sub-workflows are mapped and planned just-in-time, so mapping and execution interleaves, and allow some dynamic workflows. Full features of dynamic planning may be added in the future.

Reliability

Jobs and data transfers are automatically retried in case of failures. Debugging tools such as pegasus-analyzer help the user to debug the workflow in case of non-recoverable failures.

When errors occur, Pegasus tries to recover when possible by retrying tasks, by retrying the entire workflow, by providing workflow-level checkpointing, by re-mapping portions of the workflow, by trying alternative data sources for staging data, and, when all else fails, by providing a rescue workflow containing a description of only the work that remains to be done.

Scalability

Pegasus can scale both the size of the workflow, and the resources that the workflow is distributed over. Pegasus runs workflows ranging from just a few computational tasks up to* 1 million* (using sub-DAG). The number of resources involved in executing a workflow can scale as needed without any impediments to performance.

Support

There are currently 5 programmers involved in Pegasus development.

There are a JIRA ticket system and a user mailing list for reporting bugs, issues, and general discussion. Traffic on the list is typically a few threads per month. Important bugs seem to get resolved within a few months or sooner. For large projects, the developers may work with external collaborators directly and closely, with meetings and additional customized supports.

More than 10 large projects reported using Pegasus which covers a wide variety of domains including astrophysics, seismology, bioinformatics and others. The notable astrophysics projects include:

  • LIGO,
  • IPAC Montage: Assembling astronomical mosaic images, with FITS files as input,
  • Kepler light curve analysis.

The recent seismology project CyberShake has a high count of job metrics. The number of tasks in one workflow approaches a million. Total counts of tasks and files are O(million). Another notable use cases are Accelerated Climate Modeling for Energy (ACME) and Spallation Neutron Source (SNS), both of which are DOE applications, in the Panorama project based on Pegasus. Example DAX for various projects can be found here.

The projects is funded by NSF.

User experience

Documentation

The beginner tutorial is nice, but documentation on features beyond the tutorial level vary and some seem difficult to find. Nonetheless one can learn from the available examples.

Installation

Pegasus lists Condor and OpenSSL as its dependencies. There are binary distribution available or it can be build from scratch using ant.

Building it from the source using ant is straightforward. Though we failed to build it on Mac due to an error related to OpenSSL and we didn’t attempt to solve the problem, compiling it on Centos went fine.

Hands-on experience

For trivial jobs, the overhead seems high. We made a much simplified mock-up of the first stage of LSST DRP processing. Pegasus’ own tools cannot visualize a workflow unless it is executed successfully; this makes debugging DAX construction difficult for beginners. Besides linking the input/output, the job dependency needs to be specified explicitly.

Summary

Pegasus comes with decent documentation, offers good deal of flexibility in workflow generation, and integrates well with variety of execution environments. Moreover, its workflows are portable; once a user’s workflow can be executed locally, no code changes or DAX generators are required for remote execution. The same workflow can run across a heterogeneous set of resources. Its performance and scalability was already tested in many large scientific projects. Thus, we recommend Pegasus for further testing regarding its application for LSST Batch Product Services.

Comments

Execution environment

The Pegasus planner needs 4 components to map out the plan and make a concrete workflow over the resources:

  • DAX: the abstract workflow description in XML, generated by Pegasus API
  • Site catalog: describes the execution environment
  • Replica catalog (files): specifies locations of input data
  • Transformation catalog (executables): specifies locations of software used by the workflow

Pegasus relies on Condor DAGMan as the workflow execution engine.

MPI support

There is a tool pegasus-mpi-cluster (PMC) to run DAG on some HPC systems. PMC is designed to work either as a standalone tool or as a complement to Pegasus. In the mode that the entire workflow is managed by PMC (PMC-only mode), DAGMan and Condor are not required. Or, PMC can be used as the wrapper for executing clustered jobs in Pegasus. In this mode Pegasus groups several tasks together and submits them as a single clustered job to a remote system.

Appendix F: Pinball

Overview

Pinball is a workflow management platform developed at Pinterest, written in Python, and distributed under Apache License, v2.0. From project’s documentation:

[Pinball] is built based on layered approach. The base layer provides a generic master-worker model. Master maintains the system state represented by a collection of abstract tokens. Tokens are claimed and modified by workers implementing application specific logic. The application layer introduces tokens specific to workflows, workers implementing the job execution logic, a parser converting workflow specification to a collection of tokens, a scheduler allowing us to prepare workflow execution ahead of time, and a UI for manual control and monitoring.

Features

Sadly, available online resources does not provide any metrics regarding Pinball’s flexibility, performance metrics, interoperability, reliability and scalability. Supposedly, there is a web-based UI to monitor workflow execution, but there is no documentation regarding it.

Support

Pinball is being developed by 3 programmers associated with Pinterest.

There is a Google group mentioned in official documentation for sending question and comments but it seems to be abandoned by developers. It is mostly filled with spam and few posts with actual questions about pinball were not addressed at all.

We did not find any third parties using Pinball internally as its workflow management systems. The project does not seem to be in active development anymore; its GitHub repo shows only few commits during last year.

Though Pinterest was valued at $11 billion at June, 2015 it is hard to tell how well Pinball project is funded and what are company’s long-term plans regarding it.

User experience

Documentation

Besides project’s README file, Pinball’s documentation is practically non-existent.

Installation

Pinball requires:

  • Graphviz: a graph visualization software,
  • libmysqlclient: MySQL client libraries.

Pinball uses MySQL database as persistent storage so an access to a running MySQL instance is also required.

Installation is very easy; Pinball package is available via PyPI and can be installed by pip.

Hands-on experience

Taking into account overall impression (lack of decent documentation and support), we did not even try to setup an example workflow with Pinball.

Summary

Pinballs looks like a rather immature project with uncertain future. Practically, it has no documentation, does not seem to be actively developed, and support from developers looks rather poor. Others seems to come to similar conclusions though to be fair at least installation issue was fixed (see the link above). As a result, it is not a viable candidate for use with LSST Batch Operation Services.

Appendix G: RADICAL-Pilot

Overview

RADICAL Pilot (RP) is a pilot job system (a library) written in Python and released on MIT License. It allows a user to run large numbers of computational tasks concurrently on a multitude of different distributed resources, like HPC clusters and Clouds. RP is being developed by RADICAL, a group from Rutgers University. Its latest stable version, 0.40.3, was released in May, 2016.

Features

Flexibility

As RP is a library, its workflow specifications are Python scripts. Thus RP can be used to execute (theoretically) arbitrarily complex workflows and generate them dynamically. Jobs can be single or multi-threaded; MPI applications are also supported.

Monitoring

Each compute unit has a metadata associated with it which can be inspected and used to monitor a unit execution. However, RP do not provide any monitoring tools besides logging.

Performance

We were not able to find any metrics regarding rate of dispatching jobs or sustainable CPU utilization levels.

Interoperability

RP is built on top of The Simple API for Grid Applications (SAGA, also being developed by RADICAL), a high-level API for accessing distributed resources, meaning that RP should work on a variety of backends such as PBS, Condor and Condor-G, SGE, SLURM, LSF or Amazon EC2. Data transfer is supported via local copying, SFTP, and GSISFTP protocols.

Reliability

Each compute unit progress through a certain state model and failed units can be easily detected. However, RP documentation does not provide examples of any recovery mechanisms beyond simple retries. As far as we can tell workflow restarts or task replications are not possible.

Scalability

According to this source, RP can marshal O(10k) distributed cores and schedule O(10k) tasks in a single application context across multiple distributed HPC resources.

Support

The group developing RP consists of 4 researchers and developers and 10 students.

There are two Google groups for dealing with project’s issues:

  1. radical-pilot-users (public) to report problems and get assistance from the team and community,
  2. radical-pilot-devel (requires membership) for contributors and people interested in discussing implementation issues.

The traffic on a public forum is very light (42 threads over 2 years) with one or two threads per month on average. All reported issues were addressed by a RP developer rather quickly.

It seems that the community is very small, e.g. few users. On top of that, not a single showcase is mentioned on project’s website and we were unable to find any external project claiming using RP internally.

RP is a part of RADICAL Cybertools, an abstraction-based suite supporting science on a range of high-performance and distributed computing systems. The whole project is supported by two NSF grants till 2018 with a total budget approx $1.5M.

User experience

Documentation

Documentation includes a tutorial, with ready to run example scripts, introducing various aspects of working with RP, e.g., setting up a session, monitoring jobs, staging input and output data. However, one have to at least scheme the user guide anyway to be already familiar with core RP concepts to understand what is going on. Also, rather annoying thing is that documentation is not entirely in sync with the code base. We encountered few cases when following documentation verbatim led to execution errors and applying “Use the Force, Read the Source” principle was necessary.

Installation

RP requires the following packages:

  • Python >= 2.7,

  • Virtualenv >= 1.11,

  • Pip == 1.4.1

    Warning

    They really mean it! Newer versions of pip fail to install RP properly without emitting any errors.

Using RP on remote machines requires also setting up a passwordless login to the particular machine. Besides that, RP needs to access to a MongoDB server. The server is used to store and retrieve operational data during the execution of an application using RP. The MongoDB server must be reachable from both, the host that runs the RP application and the target resource which runs the pilots.

Installation is very easy; RP package is available via PyPI and can be installed by pip. The MongoDB location is communicated to RP via an environment variable.

Hands-on experience

Overall, writing an example workflow and executing it on a local host is relatively simple.

Summary

Though RADICAL-Pilot comes with relatively decent documentation, offers a great deal of flexibility in writing workflows, and should integrate well with typical HPC environment. Despite those advantages it comes with two serious limitations. Firstly, it only supports data on file abstraction level, so data == files at this moment which may not integrate well with data butler used by LSST Stack. Secondly, though the developers claim to fix it in a near future, RP is not yet able to function in environment using Anaconda based Python distribution. Thus it has to be ruled out as a viable candidate for LSST Batch Production Services.

Appendix H: Swift

Overview

Swift is a simple, implicitly parallel functional language with C-like syntax, written in Java, and released on Apache Licence, ver. 2. Its latest stable version, 0.96.2, was released in August, 2015. From its white paper:

Swift is a scripting language designed for composing application programs into parallel applications that can be executed on multicore processors, clusters, grids, clouds, and supercomputers. Unlike most other scripting languages, Swift focuses on the issues that arise from the concurrent execution, composition, and coordination of many independent (and, typically, distributed) computational tasks. Swift scripts express the execution of programs that consume and produce file-resident datasets. Swift uses a C-like syntax consisting of function definitions and expressions, with dataflow-driven semantics and implicit parallelism.

Features

Flexibility

As a scripting language Swift offers a great deal of flexibility in designing workflows. Support for iterations and conditional executions allows for their dynamic generation.

Monitoring

Swift offers no monitoring tools besides logs.

Performance

According to its white paper, Swift is able to dispatch thousands (approx. 3000) of jobs per second sustaining high CPU utilization (80–95%).

Portability

Remote execution and data transfer are provided by abstract interfaces called providers thus Swift execution model can be extended to new computational environments by implementing new data and/or execution providers. Currently implemented data providers supports protocols including direct local copying, GridFTP, HTTP, WebDAV, SCP, and FTP. Execution providers enable job execution by using POSIX fork, GlobusGRAM, Condor and Condor-G, PBS, SGE, SLURM, LSF, and SSH services. Swift workflows can be used on many remote computational resources including Beagle (UChicago), Blues (LCRC), Edison (NERSC), EC2 (Amazon), Open Science Grid, Stampede (XSEDE/TACC), Swan (Cray MPN). Complete list of supported sites can be found in this location.

Reliability

There are three reliability mechanisms implemented by Swift runtime environment: retries, restarts, and replications. It means that Swift will automatically rerun a failing job completely reattempting site selection, stage-in, execution, and stage-out. It also keeps a restart log that encapsulates which function invocations have been successfully completed and may be skipped in subsequent Swift runs. If a job has been queued on a site for too long (based on a configurable threshold), a replica job is submitted. If any of those jobs begin executing, other replicas will be cancelled.

Scalability

Swift easily scales in terms of size of the workflow and the resources. It can effectively run few tasks on a local machine as well as hundreds of thousands of jobs on thousand-core clusters.

Support

There are also two mailing lists for dealing with project’s issues:

  1. swift-user to report problems and receive assistance from the team and community,
  2. swift-devel for contributors and people interested in discussing implementation issues.

Though in the latest project report from October 5, 2015, the authors admitted that many Swift users still communicate with the group through direct private conversations.

Note

Due to migration of services at UC Computation Institute they are offline at the time we were writing this review and may still remain inaccessible.

In the same report they stated that at least 20 scientific teams is using Swift in designing and executing their workflows. Those teams form a very broad spectrum of scientific activities including physical, biological, social, and computer sciences. Based on unique IPs accessing project’s site, the estimated user base is approximately 2400.

According to Dan Katz, at the moment of writing this review the project is not funded. However, the NSF grant proposal for its further development has just been recommended for funding.

User experience

Documentation

Swift comes with a really extensive documentation and numerous, more in-depth scientific papers presenting its design and various performance metrics. Though its tutorial is rather limited. It does not provide much introductory information but shows examples of Swift scripts for various basic workflow. Granted, with C-like syntax getting the general idea is rather easy but some Swift specific constructs are presented without any explanation. Ironically, we found reading the user guide which explains Swift’s syntax and data types much more approachable.

Installation

Swift is written in Java and depends on Java 1.7 or greater. Though not strictly required, the authors recommend Oracle Java.

Swift/T (see Comments section for more details) additionally will require:

  • an MPI implementation (MPICH, OpenMPI, etc.),
  • Tcl 8.6,
  • Software development tools: gcc, make, SWIG,
  • optionally, ZSH for the build script.

Installation is very simple. Swift can be installed either by downloading precompiled binary distribution or compiling its source though for that Apache Ant has to be installed in addition to JDK.

Currently there is no binary distribution for Swift/T. It has to be compiled from the source.

Hands-on experience

There is a significant overhead with running even simple custom workflows as a user must firstly learn Swift syntax and understand it language model. But once those aspects are more or less understood, executing a workflow on distributed computer resources is rather simple.

Summary

Swift seems to be a feature-rich, thoughtfully designed piece of software with well written, extensive documentation, and quite thoroughly tested in scientific settings. It major drawback comes from the fact that there is rather steep learning curve associated with getting acquainted with its language model. There is also a minor issue of lacking any high-level monitoring tools. However, it still seems to be a viable candidate for a workflow management system to be used in LSST Batch Operation Services.

Comments

This review focuses mainly on Swift/K version which runs on the Karajan grid workflow engine and is designed to execute a workflow of program executions across wide area resources, exploiting diverse schedulers (PBS, Condor, etc.) and data transfer technologies. There exists also a newer implementation of the Swift language for high-performance computing called Swift/T. In this implementation, the Swift script is translated into an MPI program that uses the Turbine and ADLB runtime libraries for highly scalable dataflow processing over MPI on a single large system. Comparing to Swift/K it offers:

  • Enhanced performance: 1.5 billion tasks/s;
  • Ability to call native code functions (C, C++, Fortran);
  • Ability to execute scripts in embedded interpreters (Python, R, Tcl, Julia, etc.);
  • Enhanced built-in libraries (string, math, system, etc.).

Its latest stable version, 0.8.0, was released in April, 2015.

According to Daniel Katz, once ability to execute tasks at different computational sites is implemented in Swift/T, Swift/K version will be deprecated. At the same time, Swift scripts written for Swift/K should require little to no modifications to be able to run with Swift/T.

Swift/K uses Globus as middleware to talk to various resources.

References

[CONOPS](1, 2, 3) Concept of Operations for the LSST Batch Production Services (http://ls.st/cr4).
[XSEDE]Workflow Systems on XSEDE, tutorial presented at XSEDE15 and XSEDE16 (https://sites.google.com/site/xsedeworkflows/).