Instrumentation developed for the web is is typically feature flagged using one or more MediaWiki configuration variables. These variables are assigned a distinguished default value, some form of 'false', and the instrumentation initialization checks for this default value and, if it finds it, will not initialize. In this way, instrumentation that is merged and deployed to production is, by default, disabled on all projects.
MediaWiki configuration variables can be assigned values per-project (e.g. English Wikipedia, Wiktionary, Hawaiian Wikipedia, Metawiki,...) and this method is used to selectively enable instrumentation on a subset of projects. It is also used to gradually roll out new instrumentation.
Despite this method working fine in theory, in practice it takes far longer to ramp a piece of instrumentation than it probably should.
- Which projects to target
- In which sequence
- Whether to enable them in tranches or one-by-one
- How much time to leave between each step
- How to evaluate the results of each step
- When to pause
- When to roll back
are all decisions that are handled in a mostly ad-hoc way. As part of our goal to improve operational excellence in the area of instrumentation, we would like to develop and codify a ramp and set of criteria that will give us a standard procedure for deploying into production. This will also help us to speed patches into production.
There has been prior work in this area, e.g. https://arxiv.org/abs/1801.08532 from LinkedIn, but it is geared towards experimental results. We also have to concern ourselves with the heterogeneity of our projects, their environments, users, content, etc., as well as the usual concerns about event rate.