Incident Documentation: An Unexpected Journey

Introduction

The Release Engineering team wants to continually improve the quality of our software over time. One of the ways in which we hoped to do that this year is by creating more useful Selenium smoke tests. (From now on, test will be used instead of Selenium test.) This blog post is about how we determined where the tests should focus and the relative priority.

At first, I thought this would be a trivial task. A few hours of work. A few days at most. A week or two if I've completely underestimated it. A couple of months later, I know I have completely underestimated it.

Things I needed to do:

  • Define prioritization scheme.
  • Prioritize target repositories.

Define Prioritization Scheme

In general:

  • Does a repository have stewards? (Do the stewards want tests?)
  • Does a repository have existing tests?

For the last year:

  • How much change did happen for a repository? Simply put: more change can lead to more risk.
  • How many incidents is a repository connected to? We wanted to make sure we didn't miss any obvious problematic areas.

Does a Repository Have Stewards?

This was relatively simple task. The best source of information is Developers/Maintainers page.

Does a Repository Have Existing Tests?

This was also easy. Selenium/Node.js page has list of repositories that have tests in Node.js. I already had all repositories with Node.js and Ruby tests on my machine, so a quick search for webdriverio (Node.js) and mediawiki_selenium (Ruby) found all the tests. In order to be really sure I've found all repositories with tests, I've cloned all repositories from Gerrit.

$ ack --json webdriverio
extensions/Echo/package.json
27:        "webdriverio": "4.12.0"
...
$ ack --type-add=lock:ext:lock --lock mediawiki_selenium
skins/MinervaNeue/Gemfile.lock
42:    mediawiki_selenium (1.7.3)
...

To make extra sure I have not missed any repositories, I've used MediaWiki code search (mediawiki_selenium, webdriverio) and GitHub search (org:wikimedia extension:lock mediawiki_selenium, org:wikimedia extension:json webdriverio)

This is the list.

RepositoryLanguage
mediawiki/coreJavaScript
mediawiki/extensions/AdvancedSearchJavaScript
mediawiki/extensions/CentralAuthRuby
mediawiki/extensions/CentralNoticeRuby
mediawiki/extensions/CirrusSearchJavaScript
mediawiki/extensions/CiteJavaScript
mediawiki/extensions/EchoJavaScript
mediawiki/extensions/ElectronPdfServiceJavaScript
mediawiki/extensions/GettingStartedRuby
mediawiki/extensions/MathJavaScript
mediawiki/extensions/MobileFrontendRuby
mediawiki/extensions/MultimediaViewerRuby
mediawiki/extensions/NewsletterJavaScript
mediawiki/extensions/ORESJavaScript
mediawiki/extensions/PopupsJavaScript
mediawiki/extensions/QuickSurveysRuby
mediawiki/extensions/RelatedArticlesJavaScript
mediawiki/extensions/RevisionSliderRuby
mediawiki/extensions/TwoColConflictJavaScript, Ruby
mediawiki/extensions/WikibaseJavaScript, Ruby
mediawiki/extensions/WikibaseLexemeJavaScript, Ruby
mediawiki/extensions/WikimediaEventsPHP
mediawiki/skins/MinervaNeueRuby
phab-deploymentJavaScript
wikimedia/community-tech-toolsRuby
wikimedia/portals/deployJavaScript

How Much Change Did Happen for a Repository?

After reviewing several tools, I've found that we already use Bitergia for various metrics. There is even a nice list of top 50 repositories by the number of commits. The tool even supports limiting the report from a date to a date. Exactly what I needed.

Bitergia > Last 90 days > Absolute > From 2017-11-01 00:00:00.000 > To 2018-10-31 23:59:59.999 > Go > Git > Overview > Repositories (raw data: P7776, direct link).

This is the top 50 list (excludes empty commits and bots).

RepositoryCommits
mediawiki/extensions11300
operations/puppet7988
mediawiki/core4590
operations/mediawiki-config4005
integration/config1652
operations/software/librenms1169
pywikibot/core927
mediawiki/extensions/Wikibase806
apps/android/wikipedia789
mediawiki/services/parsoid700
mediawiki/extensions/VisualEditor692
operations/dns653
VisualEditor/VisualEditor599
mediawiki/skins570
mediawiki/extensions/MobileFrontend504
mediawiki/extensions/ContentTranslation491
translatewiki486
oojs/ui469
wikimedia/fundraising/crm457
mediawiki/extensions/BlueSpiceFoundation414
mediawiki/extensions/CirrusSearch357
mediawiki/extensions/AbuseFilter306
phabricator/phabricator302
mediawiki/services/restbase290
mediawiki/extensions/Flow232
mediawiki/extensions/Echo223
mediawiki/vagrant221
mediawiki/extensions/Popups184
mediawiki/extensions/Translate182
mediawiki/extensions/DonationInterface180
analytics/refinery178
mediawiki/extensions/PageTriage177
mediawiki/extensions/Cargo176
mediawiki/tools/codesniffer156
mediawiki/extensions/TimedMediaHandler152
mediawiki/extensions/UniversalLanguageSelector142
mediawiki/vendor140
mediawiki/extensions/SocialProfile139
analytics/refinery/source138
operations/software137
mediawiki/services/restbase/deploy136
operations/debs/pybal123
mediawiki/extensions/CentralAuth116
mediawiki/tools/release116
mediawiki/services/cxserver112
mediawiki/extensions/BlueSpiceExtensions110
mediawiki/extensions/WikimediaEvents110
labs/private108
operations/debs/python-kafka104
labs/tools/heritage96

I've got similar results with running git rev-list for all repositories (script, results: P7834).

How Many Incidents Is a Repository Connected To?

This proved to be the most time consuming task.

I have started by reviewing existing incident documentation. Take a look at a few incidents. Can you tell which incident report is connected to which repository? I couldn't. (If you can, please let me know. I need your help.)

Incident reports are a wall of text. It was really hard for me to connect an incident report to a repository. An incident report has a title and text, example: 20180724-Train. Text has several sections, including Actionables. Text contains links to Gerrit patches and Phabricator tasks. (From now on, I'll use patches instead of Gerrit patches and tasks instead of Phabricator tasks.)

A patch belongs to a repository. Wikitext [[gerrit:448103]] is patch mediawiki/extensions/Wikibase/+/448103, so repository is mediawiki/extensions/Wikibase. That is the strongest link between an incident and a repository.

A task usually has patches associated with it. Wikitext [[phab:T181315]] is patch T181315. Gerrit search bug:T181315 finds many connected patches, many of them in operations/puppet and one in mediawiki/vagrant. That is an useful, but not a strong link between an incident and a repository. Some tasks have several related patches, so it provides a lot of data.

A task also usually has several tags. Most of them are not useful in this context, but tags that are components (and not for example milestones or tags) could be useful, if the component can be linked to a repository. It is also not a strong link between an incident and a repository, and it usually does not provide a lot of data.

At the end, I wrote a tool with imaginative name, Incident Documentation. The tool currently collects data from patches and tasks from Actionables section of the incident report. It does not collect data from task components. It is tracked as issue #5.

Incident Review 2017-11-01 to 2018-10-31

After reviewing Actionables section for each incident report, related patches and tasks, here are the results. Please note this table only connects incident report and repositories. It does not show how many patches from a repository are connected to an incident report. It is tracked as issue #11.

RepositoryIncidents
operations/puppet22
mediawiki/core6
operations/mediawiki-config4
mediawiki/extensions/Wikibase4
wikidata/query/rdf2
operations/debs/pybal2
mediawiki/extensions/ORES2
integration/config2
wikidata/query/blazegraph1
operations/software1
operations/dns1
mediawiki/vagrant1
mediawiki/tools/release1
mediawiki/services/ores/deploy1
mediawiki/services/eventstreams1
mediawiki/extensions/WikibaseQualityConstraints1
mediawiki/extensions/PropertySuggester1
mediawiki/extensions/PageTriage1
mediawiki/extensions/Cognate1
mediawiki/extensions/Babel1
maps/tilerator/deploy1
maps/kartotherian/deploy1
integration/jenkins1
eventlogging1
analytics/refinery/source1
analytics/refinery1
All-Projects1

Selecting Repositories

This table is sorted by the amount of change. The only column that needs explanation is Selected. It shows if a test makes sense for the repository, taking into account all available data. Repositories without maintainers and with existing tests are excluded.

RepositoryChangeStewardsCoverageIncidentsSelected
mediawiki/extensions11300
operations/puppet7988SRE22
mediawiki/core4590Core PlatformJavaScript6
operations/mediawiki-config4005Release Engineering4
integration/config1652Release Engineering2
operations/software/librenms1169SRE
pywikibot/core927
mediawiki/extensions/Wikibase806WMDEJavaScript, Ruby4
apps/android/wikipedia789
mediawiki/services/parsoid700Parsing
mediawiki/extensions/VisualEditor692Editing
operations/dns653SRE1
VisualEditor/VisualEditor599Editing
mediawiki/skins570Reading
mediawiki/extensions/MobileFrontend504ReadingRuby
mediawiki/extensions/ContentTranslation491Language engineering
translatewiki486
oojs/ui469
wikimedia/fundraising/crm457Fundraising tech
mediawiki/extensions/BlueSpiceFoundation414
mediawiki/extensions/CirrusSearch357Search PlatformJavaScript
mediawiki/extensions/AbuseFilter306Contributors
phabricator/phabricator302Release Engineering
mediawiki/services/restbase290Core Platform
mediawiki/extensions/Flow232Growth
mediawiki/extensions/Echo223GrowthJavaScript
mediawiki/vagrant221Release Engineering1
mediawiki/extensions/Popups184ReadingJavaScript
mediawiki/extensions/Translate182Language engineering
mediawiki/extensions/DonationInterface180Fundraising tech
analytics/refinery178Analytics1
mediawiki/extensions/PageTriage177Growth1
mediawiki/extensions/Cargo176
mediawiki/tools/codesniffer156
mediawiki/extensions/TimedMediaHandler152Reading
mediawiki/extensions/UniversalLanguageSelector142Language engineering
mediawiki/vendor140
mediawiki/extensions/SocialProfile139
analytics/refinery/source138Analytics1
operations/software137SRE1
mediawiki/services/restbase/deploy136Core Platform
operations/debs/pybal123SRE2
mediawiki/extensions/CentralAuth116Ruby
mediawiki/tools/release1161
mediawiki/services/cxserver112
mediawiki/extensions/BlueSpiceExtensions110
mediawiki/extensions/WikimediaEvents110PHP
labs/private108
operations/debs/python-kafka104SRE
labs/tools/heritage96

Since some of the repositories connected to incidents are not in the top 50 Bitergia report, I've used git rev-list to sort them. Numbers are different because Bitergia excludes empty commits and bots (script, results: P7834).

RepositoryChangeStewardsCoverageIncidentsSelected
mediawiki/extensions/WikibaseQualityConstraints910WMDE1
mediawiki/extensions/ORES364GrowthJavaScript2
wikidata/query/rdf204WMDE2
mediawiki/extensions/Babel146Editing1
mediawiki/services/ores/deploy84Growth1
maps/kartotherian/deploy801
mediawiki/extensions/PropertySuggester67WMDE1
maps/tilerator/deploy611
mediawiki/extensions/Cognate47WMDE1
All-Projects371
eventlogging261
integration/jenkins19Release Engineering1
mediawiki/services/eventstreams161
wikidata/query/blazegraph10WMDE1

Prioritize Repositories

Change column uses Bitergia numbers. Numbers in italic are from git rev-list.

RepositoryChangeStewardsCoverageIncidentsSelected
mediawiki/extensions/VisualEditor692Editing
mediawiki/extensions/ContentTranslation491Language engineering
mediawiki/extensions/AbuseFilter306Contributors
phabricator/phabricator302Release Engineering
mediawiki/extensions/Flow232Growth
mediawiki/extensions/Translate182Language engineering
mediawiki/extensions/DonationInterface180Fundraising tech
mediawiki/extensions/PageTriage177Growth1
mediawiki/extensions/TimedMediaHandler152Reading
mediawiki/extensions/UniversalLanguageSelector142Language engineering
mediawiki/extensions/WikibaseQualityConstraints910WMDE1
mediawiki/extensions/Babel146Editing1
mediawiki/extensions/PropertySuggester67WMDE1
mediawiki/extensions/Cognate47WMDE1

The same table grouped by stewards.

RepositoryChangeStewardsCoverageIncidentsSelected
mediawiki/extensions/VisualEditor692Editing
mediawiki/extensions/Babel146Editing1
mediawiki/extensions/ContentTranslation491Language engineering
mediawiki/extensions/Translate182Language engineering
mediawiki/extensions/UniversalLanguageSelector142Language engineering
mediawiki/extensions/AbuseFilter306Contributors
phabricator/phabricator302Release Engineering
mediawiki/extensions/Flow232Growth
mediawiki/extensions/PageTriage177Growth1
mediawiki/extensions/DonationInterface180Fundraising tech
mediawiki/extensions/TimedMediaHandler152Reading
mediawiki/extensions/WikibaseQualityConstraints910WMDE1
mediawiki/extensions/PropertySuggester67WMDE1
mediawiki/extensions/Cognate47WMDE1

Conclusions

  • There are some repositories that do not fit the Selenium/end-to-end testing model (eg: operations/puppet or operations/mediawiki-config) but could benefit from other testing mechanisms or deployment practices.
  • A test could prevent an outage if it runs:
    • Every time a patch is uploaded to Gerrit. That way it could find a problem during development. That is already done for repositories that have tests.
    • After deployment. That way it could find a problem that was not found during development. In ideal case, deployment would be made to a test server in production, a test would run targeting the tests server. If it fails, further deployment would be cancelled. This is not yet done.
  • Automattic runs tests targeting WordPress.com production:

We decided to implement some basic e2e test scenarios which would only run in production – both after someone deploys a change and a few times a day to cover situations where someone makes some changes to a server or something.

Next steps:

  • I will contact owners of selected repositories (see Prioritize Repositories section) and offer help in creating the first test.
  • I will add results from Incident Documentation tool to incident reports as a new Related Repositories section. The section will link to the tool and explain how it got the data. It will also ask for edits if the data is not correct.
  • I will reach out to people that created (or edited) incident reports and ask them to populate Related Repositories section. This might have mixed results. For best results, the section will already be populated with the data from Incident Documentation tool.
  • I will add Related Repositories section to the incident report template.

Incident Documentation tool improvements:

  • There are several way to link from a wiki page to a patch or task. The tool for now only supports [[gerrit:]] and [[phab:]]. Tracked as issue #6.
  • Gerrit patches and Phabricator tasks from Actionables section do not provide enough data. The entire incident report should be used. I have limited it first because I was collecting data manually (and Actionables looked like the most important part of the incident report), later because of #6. Tracked as issue #4.
  • Find Gerrit repository from task component. Tracked as issue #5.
  • A table with the number of patches from each repository would be helpful. Tracked as issue #11.
  • A report with folder/file names from a repository that are mentioned the most. Especially useful for big repositories like operations/puppet and mediawiki/core. Tracked as issue #12.
Written by zeljkofilipin on Thu, Nov 22, 6:06 PM.
Software Engineer (International contractor)
Projects
Subscribers
None