Maniphest T242023

Add alert for app servers in prod serving outdated MediaWiki branches
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Krinkle
	Jan 6 2020, 7:57 PM

Description

Follows-up from incident T241251.

Proposal: An Icinga alert of some sorts that fires if there is any server in a production app server cluster serving a MediaWiki version other than one of the current version(s), as defined by the deployment server or a mwmaint server (e.g. noc.wm.o).

This is meant to catch a wide range of possible failure scenarios such as:

a server missing from dsh.
scap syncs failing consistently for a prolonger period of time to the point that it be more than a week behind.

Why: It makes it difficult to reason about the integrity and security of production if an app server could be significantly behind. In particular, if a server is able to talk to one or more shared services like Memcached, session store, job queue, external store, Swift, or Graphite; then a server out there with an outdated copy of MediaWiki could behave in ways developers do not account for.

This is because we generally assume that for internally breaking changes we only keep compatibility until the branch is fully deployed and the next one starts, after that we may assume the state to be at that point and never go back more than 2 versions. Violating this assumption could cause corruption or other damage.

Details

	Subject	Repo	Branch	Lines +/-
	mediawiki: Correct pathing issues for version monitoring script	operations/puppet	production	+1 -1
	mediawiki: check mw versions match those on the deploy server	operations/puppet	production	+208 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T213156 SRE FY2019 Q3:TEC6: First steps towards Canary Deployments
Open	None	T210143 Canaries canaries canaries
Open	None	T209881 Introduce state to Scap
Open	None	T218412 Define a mediawiki "version"
Resolved	hnowlan	T242023 Add alert for app servers in prod serving outdated MediaWiki branches
Resolved	fgiunchedi	T251942 Aggregate check_mw_versions alerts for each individual app server

Event Timeline

Krinkle created this task.Jan 6 2020, 7:57 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 6 2020, 7:57 PM

Krinkle moved this task from Active investigation to Follow-up prevention on the Wikimedia-Incident board.Jan 6 2020, 8:14 PM

CDanis subscribed.Jan 6 2020, 8:20 PM

Dzahn subscribed.Jan 6 2020, 8:59 PM

I think first we would look at T218412 (What even is a MediaWiki version?). My suggestion to determine one was:

sha1sum /srv/mediawiki/php/cache/gitinfo/* | sha1sum

to get a checksum of checksums so that MediaWiki and extensions together have a single version.

So i would imagine for a check like this we run that command on deployment_server and compare that to the result of the same command on all app servers.

Seems pretty do-able with Icinga. Deployment_server would get some puppet code to export the value to a text file on the webserver and on the appservers Icinga plugin fetches that and compares to the local result.

Dzahn added a project: observability.Jan 6 2020, 9:09 PM

Dzahn added a parent task: T218412: Define a mediawiki "version".

This isn't going to happen until some effort is put in making scap's management of data saner.

An alert can't be created right now, and there is nothing in our current deployment tool allowing to do correctly what is being proposed.

Enforcing full integrity and equality of the /srv/mediawiki directory would be awesome but that's imho an incremental improvement to consider after T218412 is in place.

Using the branch name could already start catching the worst cases. E.g. right now all app servers must be on 1.35.0-wmf.11. This information is readily available on the deployment host and/or via noc.wm.o and could be asserted in a fairly simple way from a shell script based on the local wikiversions.json file.

Do you think that would be useful?

• WDoranWMF assigned this task to hnowlan.Jan 21 2020, 2:26 PM

Change 566708 had a related patch set uploaded (by Hnowlan; owner: Hugh Nowlan):
[operations/puppet@production] mediawiki: check mw versions match those on the deploy server

https://gerrit.wikimedia.org/r/566708

gerritbot added a project: Patch-For-Review.Jan 23 2020, 11:30 AM

Main blocker on this alert being useful is detecting when a scap deployment is currently underway so as to avoid false criticals. Not sure how best to do this - one simpler way would be checking Last-Modified on the upstream file and not alerting within a window of however long a deploy takes plus some additional wait factor. Is it likely that there would be an update on the deploy server but not on the app servers for a long period of time?

Change 566708 merged by Hnowlan:
[operations/puppet@production] mediawiki: check mw versions match those on the deploy server

https://gerrit.wikimedia.org/r/566708

• mmodell awarded a token.Feb 10 2020, 3:05 PM

Maintenance_bot removed a project: Patch-For-Review.Feb 10 2020, 3:11 PM

In T242023#5842303, @hnowlan wrote:

Is it likely that there would be an update on the deploy server but not on the app servers for a long period of time?

30 minutes seems like the right amount of time. Kudos for implementing this.

Change 571300 had a related patch set uploaded (by Hnowlan; owner: Hnowlan):
[operations/puppet@production] mediawiki: Correct pathing issues for version monitoring script

https://gerrit.wikimedia.org/r/571300

gerritbot added a project: Patch-For-Review.Feb 10 2020, 3:21 PM

Change 571300 merged by Hnowlan:
[operations/puppet@production] mediawiki: Correct pathing issues for version monitoring script

https://gerrit.wikimedia.org/r/571300

Maintenance_bot removed a project: Patch-For-Review.Feb 10 2020, 4:10 PM

This has been rolled out to all profile::mediawiki hosts as check_mw_wikiversion_difference in nrpe and Ensure local MW versions match expected deployment in Icinga.

Krinkle edited projects, added Sustainability (Incident Followup); removed Wikimedia-Incident.Apr 28 2020, 9:50 PM

colewhite added a subtask: T251942: Aggregate check_mw_versions alerts for each individual app server.May 5 2020, 7:20 PM

fgiunchedi closed subtask T251942: Aggregate check_mw_versions alerts for each individual app server as Resolved.Aug 21 2024, 9:37 AM

fgiunchedi mentioned this in T374860: Retire mw_wikiversion_difference check.Sep 16 2024, 3:03 PM

Add alert for app servers in prod serving outdated MediaWiki branchesClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Add alert for app servers in prod serving outdated MediaWiki branches
Closed, ResolvedPublic
Actions

Related Objects
Search...