Page MenuHomePhabricator

upgrade mwmaint servers to buster
Open, MediumPublic

Description

mwmaint1002 is the server where we run scheduled tasks thus a very special host. We need to come up with a plan of how this host will be updated or if this host will replaced by a new one eg mwmaint1003.

The server should not be migrated or replaced before Jan 2021

edit: now using this ticket for both mwmaint servers, also mwmaint2001

Event Timeline

jijiki triaged this task as Medium priority.Nov 10 2020, 3:59 PM

I could take this one (later). Have done mwmaint upgrade in the past. I would ideally like to create mwmaint1003 and eventually flip over.

You're probably already thinking about this, but just to make sure it's said out loud: mwmaint1002 is still running updateCollation for the ICU upgrade, and will be chewing through enwiki for some days, so best to leave mwmaint1002 in place until that's finished. (T264991)

We should have two mwmaint servers per DC anyway (with some mechanism to flip the active one), some failover capability is needed outside of OS updates as well (reboots e.g. are a total pain with the current SPOF setup that we have)

You're probably already thinking about this, but just to make sure it's said out loud: mwmaint1002 is still running updateCollation for the ICU upgrade, and will be chewing through enwiki for some days, so best to leave mwmaint1002 in place until that's finished. (T264991)

Yes, definitely not planning to touch the existing server. I was hoping to get new hardware to install in parallel.

We should have two mwmaint servers per DC anyway (with some mechanism to flip the active one), some failover capability is needed outside of OS updates as well (reboots e.g. are a total pain with the current SPOF setup that we have)

ACK, I might take a look at improving puppet code to allow switching between multiple servers per DC.

I unintentionally created some confusion I think, and I am very sorry. I have updated the description to reflect that our target for this quarter is to have done as much preliminary work as we can regarding the upgrades of any mediawiki servers.

Dzahn renamed this task from upgrade mwmaint1002 to buster to upgrade mwmaint servers to buster .Thu, Feb 18, 6:25 PM
Dzahn updated the task description. (Show Details)

renaming this ticket to cover both mwmaint* servers and not be just for eqiad alone

Change 665144 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] install_server: switch mwmaint2001 to use buster installer

https://gerrit.wikimedia.org/r/665144

Change 665144 merged by Dzahn:
[operations/puppet@production] install_server: switch mwmaint2001 to use buster installer

https://gerrit.wikimedia.org/r/665144

Mentioned in SAL (#wikimedia-operations) [2021-02-18T23:11:16Z] <mutante> mwmaint2001 - will be rebooted for OS upgrade - T267607

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mwmaint2001.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102182316_dzahn_17848_mwmaint2001_codfw_wmnet.log.

Change 665225 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] scap: remove mwmaint2001 from "dsh" groups

https://gerrit.wikimedia.org/r/665225

Change 665225 merged by Dzahn:
[operations/puppet@production] scap: remove mwmaint2001 from "dsh" groups

https://gerrit.wikimedia.org/r/665225

Completed auto-reimage of hosts:

['mwmaint2001.codfw.wmnet']

and were ALL successful.

IIRC the previous update for the mwmaint servers happened via a hardware replacement: mwmaint1002 was new server which replaced terbium. Procedure-wise it's probably best if we reimage an existing mw* server in eqiad with the mediawiki::maintenance role and then fall back to mwmaint1002 once reimaged? But that would require to add some logic in Hiera to flag whether a server running role::mediawiki::maintenance is the current active one (most of the tasks are triggered via the common profile::mediawiki::periodic_job) or alternative Puppet is disabled manually and the systemd timers are stopped manually.

T274170 introduced new hardware mwmaint2002 and can be used now. timing :p