Page MenuHomePhabricator

make noc.wikimedia.org active/active (was: improve mw maintenance server switch over and discovery names)
Closed, ResolvedPublic

Description

In each main DC we currently have exactly one mediawiki maintenance (mwmaint*) server but ideally we want to be able to have more than one per DC for migrations.

One defintion of the single "active" server is based on where the actual mw maint "crons" (now systemd timers) are running.

Then there is another one based on which webserver is currently hosting noc.wikimedia.org.

The latter already tries to avoid hardcoded host names by using a discovery record https://mwmaint.discovery.wmnet
but the former depends on on the active data center setting in the mw-config repo.

So we had a situation where one was switched and not the other.

This ticket is to make sure they are switching together in one way or another and also add the usual puppet code that warns people when they are on a non-active server.

Event Timeline

The second part, having the inactive warning in MOTD is already done .. I see now that I am looking at it again:

115     # T199124
116     $motd_ensure = $ensure ? {
117         'present' => 'absent',
118         'absent'  => 'present',
119         default   => 'present',
120     }
121 
122     motd::script { 'inactive_warning':
123         ensure   => $motd_ensure,
124         priority => 1,
125         content  => template('profile/mediawiki/maintenance/inactive.motd.erb'),
126     }

After revisting this today I think it can be splt into 3 separate parts: (cc: @RLazarus @Joe

a) allow multiple maintenance servers per DC without enabling jobs on more than one

This is basically a duplicate of T266717 which already has discussion so I am not going to continue that part on this ticket. For the upcoming buster upgrade (T267607) this won't be needed because we decided to upgrade in place with a minimal downtime instead of installing new hardware or VMs in parallel as we often do for other things. Note though.. we are also going to get new hardware for the one in codfw it looks (T271346).

b) make sure there is MOTD warning on (all) inactive servers

This works currently - as long as there isn't more than 1 server per DC - so as above it isn't immediately needed now if we avoid a parallel setup but ideal for the future would be if this does not rely only on the active DC, while also avoiding another place where we define an "active" server. Instead it should depend on the same mechanism that is discussed in T266717. So when jobs are enabled via conftool that should also affect the MOTD.

c) avoid needing to make a DNS change to switch where the noc.wikimedia.org site is hosted.

We already use a discovery name but it's one of the "misc services with multiple backends but without geodns". So the solution would be to add geoDNS for noc.wikimedia.org and make it active/active.

Conclusion: a) and b) are duplicate and should continue on that ticket and I should probably rename this one to just be about geoDNS for noc.wm.org.

Dzahn renamed this task from improve mw maintenance server switch over and discovery names to make noc.wikimedia.org active/active (was: improve mw maintenance server switch over and discovery names).Jan 8 2021, 9:48 PM

Change 655168 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] add discovery-geo-resources for noc

https://gerrit.wikimedia.org/r/655168

Change 655168 abandoned by Dzahn:

[operations/dns@master] add discovery-geo-resources for noc

Reason:

https://gerrit.wikimedia.org/r/655168

Change 655168 restored by Dzahn:

[operations/dns@master] add discovery-geo-resources for noc

https://gerrit.wikimedia.org/r/655168

Change 655168 abandoned by Dzahn:

[operations/dns@master] add discovery-geo-resources for noc

Reason:

this should also move to k8s and behind ingress now that we have one. which should make this obsolete

https://gerrit.wikimedia.org/r/655168

Dzahn removed Dzahn as the assignee of this task.Feb 22 2023, 6:45 PM

removing assignee based on automated mail from Andre pointing out it has been assigned for more than 2 years

Krinkle assigned this task to Joe.
Krinkle added a subscriber: Krinkle.

Presumed fixed by T341859: Move noc.wikimedia.org to kubernetes.

In particular: