Page MenuHomePhabricator

Add alert for app servers in prod serving outdated MediaWiki branches
Closed, ResolvedPublic

Description

Follows-up from incident T241251.

Proposal: An Icinga alert of some sorts that fires if there is any server in a production app server cluster serving a MediaWiki version other than one of the current version(s), as defined by the deployment server or a mwmaint server (e.g. noc.wm.o).

This is meant to catch a wide range of possible failure scenarios such as:

  • a server missing from dsh.
  • scap syncs failing consistently for a prolonger period of time to the point that it be more than a week behind.

Why: It makes it difficult to reason about the integrity and security of production if an app server could be significantly behind. In particular, if a server is able to talk to one or more shared services like Memcached, session store, job queue, external store, Swift, or Graphite; then a server out there with an outdated copy of MediaWiki could behave in ways developers do not account for.

This is because we generally assume that for internally breaking changes we only keep compatibility until the branch is fully deployed and the next one starts, after that we may assume the state to be at that point and never go back more than 2 versions. Violating this assumption could cause corruption or other damage.

Event Timeline

I think first we would look at T218412 (What even is a MediaWiki version?). My suggestion to determine one was:

sha1sum /srv/mediawiki/php/cache/gitinfo/* | sha1sum

to get a checksum of checksums so that MediaWiki and extensions together have a single version.

So i would imagine for a check like this we run that command on deployment_server and compare that to the result of the same command on all app servers.

Seems pretty do-able with Icinga. Deployment_server would get some puppet code to export the value to a text file on the webserver and on the appservers Icinga plugin fetches that and compares to the local result.

Joe changed the task status from Open to Stalled.Jan 7 2020, 6:11 AM
Joe subscribed.

This isn't going to happen until some effort is put in making scap's management of data saner.

An alert can't be created right now, and there is nothing in our current deployment tool allowing to do correctly what is being proposed.

Enforcing full integrity and equality of the /srv/mediawiki directory would be awesome but that's imho an incremental improvement to consider after T218412 is in place.

Using the branch name could already start catching the worst cases. E.g. right now all app servers must be on 1.35.0-wmf.11. This information is readily available on the deployment host and/or via noc.wm.o and could be asserted in a fairly simple way from a shell script based on the local wikiversions.json file.

Do you think that would be useful?

Change 566708 had a related patch set uploaded (by Hnowlan; owner: Hugh Nowlan):
[operations/puppet@production] mediawiki: check mw versions match those on the deploy server

https://gerrit.wikimedia.org/r/566708

Main blocker on this alert being useful is detecting when a scap deployment is currently underway so as to avoid false criticals. Not sure how best to do this - one simpler way would be checking Last-Modified on the upstream file and not alerting within a window of however long a deploy takes plus some additional wait factor. Is it likely that there would be an update on the deploy server but not on the app servers for a long period of time?

Change 566708 merged by Hnowlan:
[operations/puppet@production] mediawiki: check mw versions match those on the deploy server

https://gerrit.wikimedia.org/r/566708

Is it likely that there would be an update on the deploy server but not on the app servers for a long period of time?

30 minutes seems like the right amount of time. Kudos for implementing this.

Change 571300 had a related patch set uploaded (by Hnowlan; owner: Hnowlan):
[operations/puppet@production] mediawiki: Correct pathing issues for version monitoring script

https://gerrit.wikimedia.org/r/571300

Change 571300 merged by Hnowlan:
[operations/puppet@production] mediawiki: Correct pathing issues for version monitoring script

https://gerrit.wikimedia.org/r/571300

This has been rolled out to all profile::mediawiki hosts as check_mw_wikiversion_difference in nrpe and Ensure local MW versions match expected deployment in Icinga.