HomePhabricator
Runnable runbooks
A little side project

Recently there has been a small effort on the Release-Engineering-Team to encode some of our institutional knowledge as runbooks linked from a page in the team's wiki space.

What are runbooks, you might ask? This is how they are described on the aforementioned wiki page:

This is a list of runbooks for the Wikimedia Release Engineering Team, covering step-by-step lists of what to do when things need doing, especially when things go wrong.

So runbooks are each essentially a sequence of commands, intended to be pasted into a shell by a human. Step by step instructions that are intended to help the reader accomplish an anticipated task or resolve a previously-encountered issue.

Presumably runbooks are created when someone encounters an issue, and, recognizing that it might happen again, helpfully documents the steps that were used to resolve said issue.

This all seems pretty sensible at first glance. This type of documentation can be really valuable when you're in an unexpected situation or trying to accomplish a task that you've never attempted before and just about anyone reading this probably has some experience running shell commands pasted from some online tutorials, setup instructions for a program, etc.

Despite the obvious value runbooks can provide, I've come to harbor a fairly strong aversion to the idea of encoding what are essentially shell scripts as individual commands on a wiki page. As someone who's job involves a lot of automation, I would usually much prefer a shell script, a python program, or even a "maintenance script" over a runbook.

After a lot of contemplation, I've identified a few reasons that I don't like runbooks on wiki pages:

  • Runbooks are tedious and prone to human errors.
    • It's easy to lose track of where you are in the process.
    • It's easy to accidentally skip a step.
    • It's easy to make typos.
  • A script can be code reviewed and version controlled in git.
  • A script can validate it's arguments which helps to catch typos.
  • I think that command line terminal input is more like code than it is prose. I am more comfortable editing code in my usual text editor as apposed to editing in a web browser. The wikitext editor is sufficient for basic text editing, and visual editor is quite nice for rich text editing, but neither is ideal for editing code.

I do realize that mediawiki does version control. I also realize that sometimes you just can't be bothered to write and debug a robust shell script to address some rare circumstances. The cost is high and it's uncertain whether the script will be worth such an effort. In those situations a runbook might be the perfect way to contribute to collective knowledge without investing a lot of time into perfecting a script.

My favorite web comic, xkcd, has a lot few things to say about this subject:

the_general_problem.png (230×550 px, 23 KB)
"The General Problem" xkcd #974.

automation_2x.png (817×807 px, 55 KB)
"Automation" xkcd #1319.

is_it_worth_the_time_2x.png (927×1 px, 125 KB)
"Is It Worth the Time?" xkcd #1205.

Potential Solutions

I've been pondering a solution to these issues for a long time. Mostly motivated by the pain I have experienced (and the mistakes I've made) while executing the biggest runbook of all on a regular basis.

Over the past couple of years I've come across some promising ideas which I think can help the problems I've identified with runbooks. I think that one of the most interesting is Do-nothing scripting. Dan Slimmon identifies some of the same problems that I've detailed here. He uses the term *slog* to refer to long and tedious procedures like the Wikimedia Train Deploys. The proposed solution comes in the form of a do-nothing script. You should go read that article, it's not very long. Here are a few relevant quotes:

Almost any slog can be turned into a do-nothing script. A do-nothing script is a script that encodes the instructions of a slog, encapsulating each step in a function.

...

At first glance, it might not be obvious that this script provides value. Maybe it looks like all we’ve done is make the instructions harder to read. But the value of a do-nothing script is immense:

  • It’s now much less likely that you’ll lose your place and skip a step. This makes it easier to maintain focus and power through the slog.
  • Each step of the procedure is now encapsulated in a function, which makes it possible to replace the text in any given step with code that performs the action automatically.
  • Over time, you’ll develop a library of useful steps, which will make future automation tasks more efficient.

A do-nothing script doesn’t save your team any manual effort. It lowers the activation energy for automating tasks, which allows the team to eliminate toil over time.

I was inspired by this and I think it's a fairly clever solution to the problems identified. What if we combined the best aspects of gradual automation with the best aspects of a wiki-based runbook? Others were inspired by this as well, resulting in tools like braintree/runbook, codedown and the one I'm most interested in, rundoc.

Runnable Runbooks

My ideal tool would combine code and instructions in a free-form "literate programming" style. By following some simple conventions in our runbooks we can use a tool to parse and execute the embedded code blocks in a controlled manner. With a little bit of tooling we can gain many benefits:

  • The tooling will keep track of the steps to execute, ensuring that no steps are missed.
  • Ensure that errors aren't missed by carefully checking / logging the result of each step.
  • We could also provide a mechanism for inputting the values of any variables / arguments and validate the format of user input.
  • With flexible control flow management we can even allow resuming from anywhere in the middle of a runbook after an aborted run.
  • Manual steps can just consist of a block of prose that gets displayed to the operator. With embedded markup we can format the instructions nicely and render them in the terminal using [Rich][7]. Once the operator confirms that the step is complete then the workflow moves on to the next step.

Prior Art

I've found a few projects that already implement many of these ideas. Here are a few of the most relevant:

The one I'm most interested in is Rundoc. It's almost exactly the tool that I would have created. In fact, I started writing code before discovering rundoc but once I realized how closely this matched my ideal solution, I decided to abandon my effort. Instead I will add a couple of missing features to Rundoc in order to get everything that I want and hopefully I can contribute my enhancements back upstream for the benefit of others.

Demo: https://asciinema.org/a/MKyiFbsGzzizqsGgpI4Jkvxmx
Source: https://github.com/20after4/rundoc

References

[1]: https://www.mediawiki.org/wiki/Wikimedia_Release_Engineering_Team/Runbooks "runbooks"
[2]: https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys "Train deploys"
[3]: https://blog.danslimmon.com/2019/07/15/do-nothing-scripting-the-key-to-gradual-automation/ "Do-nothing scripting: the key to gradual automation by Dan Slimmon"
[4]: https://github.com/braintree/runbook "runbook by braintree"
[5]: https://github.com/earldouglas/codedown "codedown by earldouglas"
[6]: https://github.com/eclecticiq/rundoc "rundoc by eclecticiq"
[7]: https://rich.readthedocs.io/en/latest/ "Rich python library"

Written by mmodell on Dec 11 2020, 11:51 PM.
Volunteer Phabricator Admin

Event Timeline

This comment was removed by 20after4.