Process Automation

Yamop

One of my most fundamental contributions to the Chung Lab was the near complete overhaul of our neural imaging computing pipeline, an effort I dubbed “Yamop” (Yet Another Microscopy Operations Pipeline) in honor of our similarly named ops platform at Panalgo: “Yaeop” (Yet Another ETL Operations Pipeline).

Originally, nearly all computing tasks were performed by individual users running programs through the command line. This was highly inefficient, though it is, from my understanding, not particularly uncommon outside of companies specifically devoted to software development and computer science labs. The precise shape of the problem plummets into minutia very quickly, and rather than provide paragraphs of prose, I think things can be summarized by this: before I arrived the lab lacked task scheduling, resource allocation, logging, monitoring, and quality assurance. Yamop as a platform of wrapping scripts provided for our pipelines: task scheduling, resource allocation, logging, monitoring, and quality assurance. There you have it.

Components

Yamop was primarily written in bash and python, with a small amount of javascript, though bash was mostly used to interface with unix packages like cron and rsynch.

Basic Framework

The main framework of Yamop is split between three github repositories: the yamop repository which contains the code itself, yamop_configs which contains .json config files for pipelines to be run through the system, and a yamop_logs repository, which stores logs generated by any task run through the platform.

A user can use the run_task.py script, followed by the name of a bash or python script, and whatever modifiers that script requires, and yamop runs the given script as normal with the benefit of it’s suite of features.

The aforementioned config .json files are run through the run_config.py script, which essentially chains together run_task.py commands.

Task Scheduling

Task scheduling was achieved through the run_config.py script, as every step of a pipeline could be compiled into a generic config file, and then have the necessary inputs for any particular sample given by the operator at launch.

Furthermore the cron unix package was utilized to perform necessary repeating tasks, like automatically pushing logs to the yamop_logs git repo, though the same functionality was never implemented for the main yamop repo or yamop_configs for obvious reasons.

Resource Allocation

Resource allocation was implemented through a mix of the cgroups and mpstat packages for unix, with the former allowing a user to specify a desired priority for their program, and the latter being used to check overall cpu usage on the system. If the current computing server was already full, yamop is capable of running the given process on another less busy server on the network through a rather convoluted implementation of ansible.

Logging

As mentioned above, yamop logs the output of scripts to a log file, which is then uploaded to the yamop_logs repo. Of all the modules in the repo, this probably has the highest ratio of ease of implementation to amount of work to actually become useful, which is to say, the logging of the output of our various scripts was easy, but developing methods to sift through it so as to only include the useful information took ages.

Monitoring

The monitoring module is composed of two main elements: a web based monitoring board and a slack bot.

The monitoring board is a web page running on the lab’s apache II server written in javascript, which displays the currently active jobs on each server, each server’s overall capacity, and the jobs which are expected to run in the future (those which are set to be run from a config file).

The slack bot monitors the output of any process run through yamop, and sends messages to the user if it seems their task has stalled, errored out, or is otherwise taking longer than expected.

Quality Assurance

The final aspect of yamop was quality assurance, which was highly specific per task. For the microscopy processing pipeline, this took the form of several scripts meant to better align each chunk of a given image, and then align each sample slab together, however the framework of the system allowed for any number of tasks to be considered the “QA” step of a given config.

While nearly every other aspect of the main pipeline was automated, the work this portion demands remains stubbornly human.

Results

Yamop is the computing portion of the paper the Chung Lab published recently published in Science. It’s why we’re able to analyse and share the sheer volume of neural data that the lab can image.

It also reduced the total man hours per TB of processed data from .6 hours/TB (about 5 hours to fully process and run QC on a 3TB sample) to .16 hours/TB (about a half hour of operator work to fully process and run QC on a 3TB sample). Or about one 10th of the previous time. Now the entire output of the lab’s neural imaging can be processed and verified by one trained staff member.

Many grad students were freed from toil by this advancement.