Page MenuHomePhabricator

Assess switchover behavior for mw-wikifunctions
Closed, ResolvedPublic

Description

In T384944, mw-wikifunctions migrated to using k8s ingress. Among other things, that means it not longer has its own discovery services - i.e., mw-wikifunctions.discovery.wmnet and mw-wikifunctions-ro.discovery.wmnet are really just CNAMEs for the respective k8s-ingress-wikikube services [0].

At the very least, the MEDIAWIKI_SERVICES and MEDIAWIKI_RO_SERVICES lists in [0] need to be updated to remove these not-actually-a-discovery-service services. If we're really careful about it, that can happen once [2] is merged (ideally not before, given how some of the filtering works in the sre.discovery.datacenter cookbook).

However, there's kind of a broader question here:

We now have a mediawiki instance whose active-passive traffic moves during the Day 1 services switchover (i.e., with k8s-ingress-wikikube-rw), rather than coordinated with the database primary switchover on Day 2.

Having primary database ops going cross-DC for 24h "should be fine" now-a-days given the rewriting we do in the secondary DC in mediawiki config [3]. In fact, we had that happen with mw-videoscaler during the last switchover, since transcode processing stays local now: with eventgate being depooled in the primary DC on Day 1, the jobs would be enqueued and processed in the then-secondary DC (see T372849#10653971).

In any case, we should probably document the desired behavior somewhere to make this explicit.

[0] https://gerrit.wikimedia.org/r/c/operations/dns/+/1133878

[1] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/switchdc/mediawiki/__init__.py

[2] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1163856

[3] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/refs/heads/master/src/etcd.php

Event Timeline

Scott_French renamed this task from Update switchover behavior for mw-wikifunctions to Assess switchover behavior for mw-wikifunctions.Jun 25 2025, 8:13 PM

Change #1184125 had a related patch set uploaded (by Jasmine; author: Jasmine):

[operations/cookbooks@master] switchdc: remove mw-wikifunctions discovery services following move to k8s ingress

https://gerrit.wikimedia.org/r/1184125

Change #1184125 merged by jenkins-bot:

[operations/cookbooks@master] switchdc: remove mw-wikifunctions discovery services following move to k8s ingress

https://gerrit.wikimedia.org/r/1184125

Scott_French edited projects, added ServiceOps new; removed serviceops-deprecated.
Scott_French moved this task from Inbox to Scheduled (this Q) on the ServiceOps new board.

Speculatively moving this to Scheduled, since it would be good to make the respective documentation changes prior to the upcoming switchover.

@jasmine_ - Do you think you might have a chance to do that as part of the Wikitech updates you're already working on? Basically, that all (not just RO) mw-wikifunctions traffic switches on Day 1 with the other Ingress services, which means there's a period of ~ 24h where primary DB queries are cross-DC, and empirically we've found this seems to be "fine" (at least in this context).

Thanks Scott~ Switchover docs have been updated to note the change. I've noted it under [0] however feel free to let me know if it should be more/less visible documentation wise.

[0] - https://wikitech.wikimedia.org/w/index.php?title=Switch_Datacenter&wvprov=sticky-header#High_Level_switchover_flow