Detect missing extension dependencies before production
Open, NormalPublic

Description

Actionable from https://wikitech.wikimedia.org/wiki/Incident_documentation/20180724-Train.
Follows-up T200412: PageTriage requires ORES to be installed.

During the 1.32.0-wmf.14 development cycle, the PageTriage extension declared a dependency on ORES (see https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/PageTriage/+/443856/; d350348d).

This was intentional and it was thought that all supported contexts would have both extensions installed (local developer, CI job, Beta Cluster wikis, prod wikis).

However, it was discovered after deployment that test2.wikipedia.org was broken due to it not having ORES installed.

This is task is about evaluating ways to have caught this before the deployment.

Krinkle created this task.Aug 1 2018, 9:41 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 1 2018, 9:41 PM

Some initial ideas I discarded for myself:

  • CI for PageTriage repository: Seems hard given that the job would either have it or not, and to test its ORES features we do install it. Any form of test declaration would only serve to verify itself and not help prevent issues where it mismatches production.
  • Beta Cluster: It may be possible to catch this sort of issue in Beta, and it seems like we already would have, but we don't have a Beta wiki for every prod wiki, and if we would, most of them would likely not have enough traffic to notice that one of the many is broken.

Some other ideas that might work?

  • CI for wmf-config repository: It might be possible to detect this in CI for wmf-config. Specifically, the commit enabling the branch could maybe somehow statically detect which extensions are enabled on a wiki, and what their dependencies are and fail the job if dependencies aren't met.
  • Deployment (Scap/canary):
    • Endpoint checks: These checks should be run against at least one wiki/hostname in each deployment group (group0-2). This is currently limited to one run, per mediawiki_canary_swagger_url (en.wikipedia).
    • Logstash checks: These Logstash query would've already found the ExtensionDependencyError exception, but presumably didn't trigger due to limited traffic to test2wiki.
Krinkle edited projects, added Scap; removed Deployments.
thcipriani triaged this task as Normal priority.

CI for wmf-config repository: It might be possible to detect this in CI for wmf-config. Specifically, the commit enabling the branch could maybe somehow statically detect which extensions are enabled on a wiki, and what their dependencies are and fail the job if dependencies aren't met.

+1 for this. It's not a small effort to add this but seems like it would be really useful to avoid these types of problems in the future.