Page MenuHomePhabricator

Replace production deployment servers and update them to Buster
Closed, ResolvedPublic

Description

deploy1001 and deploy2001 are going to be replaced with new hardware as part of regular hardware refresh due to age

  • deploy1002 setup T265653
  • deploy2002 setup T264633

Since we are migrating all servers to Buster, it makes sense to directly install Buster.

  • install buster on deploy1002
  • install buster on deploy2002
  • apply deployment server role on deploy1002
  • temp remove deploy1002 from scap dsh groups to avoid errors for deployers
  • fix: E: Package 'mysql-client' has no installation candidate (/Profile::Mediawiki::Deployment::Server/File[/usr/local/sbin/fix-staging-perms]: Dependency Package[mysql-client] has failures: true)
  • fix scap bootstrap issue / run scap deploy --init (worked around with rsync/puppet, scap-sync-master and scap pull)

(/usr/bin/scap deploy --init fails when running on non-active servers)

  • apply deployment server role on deploy2002
  • re-add both servers to scap DSH groups
  • sync repo data over from old servers to new servers
  • schedule and announce switchover date
  • actually switch active server in puppet, check the other servers have the "warning MOTD" to tell devs what is and what isn't the right server

Details

ProjectBranchLines +/-Subject
operations/puppetproduction+2 -5
operations/homer/publicmaster+8 -0
operations/puppetproduction+8 -7
operations/dnsmaster+2 -2
operations/puppetproduction+6 -0
operations/puppetproduction+2 -1
operations/puppetproduction+3 -3
operations/dnsmaster+0 -4
operations/puppetproduction+2 -0
operations/puppetproduction+4 -0
labs/privatemaster+0 -0
operations/puppetproduction+1 -1
operations/puppetproduction+1 -6
operations/puppetproduction+0 -1
operations/puppetproduction+0 -2
operations/puppetproduction+1 -0
operations/puppetproduction+1 -5
labs/privatemaster+0 -0
operations/puppetproduction+3 -0
operations/puppetproduction+1 -1
Show related patches Customize query in gerrit

Related Objects

StatusSubtypeAssignedTask
Stalledtstarling
StalledNone
StalledNone
OpenNone
StalledNone
StalledNone
StalledNone
StalledNone
OpenNone
OpenNone
ResolvedJdforrester-WMF
OpenNone
OpenNone
ResolvedDzahn
ResolvedPapaul
ResolvedCmjohnson
Opendancy
OpenNone
ResolvedRequestCmjohnson
ResolvedRequestPapaul

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 635404 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: add deployment_server role on deploy1002

https://gerrit.wikimedia.org/r/635404

Mentioned in SAL (#wikimedia-operations) [2020-10-22T18:34:30Z] <mutante> adding mcrouter cert for deploy1002.eqiad.wmnet T265963

Change 635879 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[labs/private@master] add fake mcrouter certs for deploy1002.eqiad.wmnet

https://gerrit.wikimedia.org/r/635879

Change 635879 merged by Dzahn:
[labs/private@master] add fake mcrouter certs for deploy1002.eqiad.wmnet

https://gerrit.wikimedia.org/r/635879

Change 635404 merged by Dzahn:
[operations/puppet@production] site: add deployment_server role on deploy1002

https://gerrit.wikimedia.org/r/635404

Change 635109 merged by Dzahn:
[operations/puppet@production] scap/dsh: add deploy1002 to mediawiki_installation hosts

https://gerrit.wikimedia.org/r/635109

Mentioned in SAL (#wikimedia-operations) [2020-10-22T21:56:09Z] <mutante> deploy1002 - scap pull and added to mediawiki-installation "dsh" group - will be part of scap trains but just like any appserver (T265963)

Mentioned in SAL (#wikimedia-operations) [2020-10-22T22:03:12Z] <mutante> deploy1002 - armed keyholder, all deployment keys loaded T265963

Dzahn mentioned this in Unknown Object (Task).Nov 3 2020, 2:53 AM
Papaul closed subtask Unknown Object (Task) as Resolved.Nov 9 2020, 4:06 PM

Let's directly install deploy1002 with Buster? This allows to adapt Puppet manifests/packages for Buster and run tests and then in January the production deployment server can simply be switched over?

jijiki added a subscriber: jijiki.

Let's directly install deploy1002 with Buster? This allows to adapt Puppet manifests/packages for Buster and run tests and then in January the production deployment server can simply be switched over?

Yes! I have updated the description to reflect that. If we install Buster now, we are sparing ourselves the drama of upgrading later.

jijiki renamed this task from replace production deployment servers to Replace production deployment servers and update them to Buster.Nov 26 2020, 3:14 PM

Change 644319 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] switch deploy1002/deploy2002 to use buster installer

https://gerrit.wikimedia.org/r/644319

Change 644319 merged by Dzahn:
[operations/puppet@production] switch deploy1002/deploy2002 to use buster installer

https://gerrit.wikimedia.org/r/644319

Change 644320 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] scap: temp remove deploy1002 from dsh group (don't deploy mw to it)

https://gerrit.wikimedia.org/r/644320

Change 644320 merged by Dzahn:
[operations/puppet@production] scap: temp remove deploy1002 from dsh group (don't deploy mw to it)

https://gerrit.wikimedia.org/r/644320

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

deploy1002.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202011302027_dzahn_8991_deploy1002_eqiad_wmnet.log.

Mentioned in SAL (#wikimedia-operations) [2020-11-30T20:28:51Z] <mutante> reimaging deploy1002 with buster - not the active deployment server, deploy1001 still is (T265963)

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

deploy2002.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202011302042_dzahn_12260_deploy2002_codfw_wmnet.log.

Mentioned in SAL (#wikimedia-operations) [2020-11-30T20:42:53Z] <mutante> reimaging deploy2002 with buster (not active, deploy1001/2001 are) T265963

Completed auto-reimage of hosts:

['deploy2002.codfw.wmnet']

and were ALL successful.

Change 644333 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: add deploy2002 and unify deployment server role regex

https://gerrit.wikimedia.org/r/644333

Dzahn updated the task description. (Show Details)

Change 644350 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] deployment::server: buster support, use mariadb-client, not mysql-client

https://gerrit.wikimedia.org/r/644350

Completed auto-reimage of hosts:

['deploy1002.eqiad.wmnet']

Of which those FAILED:

['deploy1002.eqiad.wmnet']

Change 644350 merged by Dzahn:
[operations/puppet@production] deployment::server: buster support, use default-mysql-client package

https://gerrit.wikimedia.org/r/644350

@jijiki The issue with "mysql-client" not existing on buster has been fixed by using default-mysql-client. This was noop on stretch deployment servers and removed a puppet error and further dependencies for httpbb tests on the new buster servers.

The next issue is this:

Error: Execution of '/usr/bin/scap deploy --init' returned 70: 20:10:30 deploy failed: <LockFailedError> Failed to acquire lock "/var/lock/scap-global-lock"; owner is "root"; reason is "Not the active deployment server, use deploy1001.eqiad.wmnet"

puppet tries to run "scap deploy --init" (many times actually, once for each repo but they all fail because scap is locked when it is not on the active deployment server.

So either this just needs to wait until we have an actual maintenance window where we switch and then someone does the scap init dance... ( I think releng used to do this last time) or we need to allow running init on non-active servers but making sure we never mess with actual deployments of course. The horror scenario would be that it syncs to appservers before having the current code from the previous server.

Change 644333 merged by Dzahn:
[operations/puppet@production] site: add deploy2002 and unify deployment server role regex

https://gerrit.wikimedia.org/r/644333

As @hashar pointed out there is also T257317 which is describing exactly this issue and I can simply quote Jaime: "scap syncronization (all methods) should be disabled because of the lock, but probably --init should be allowed".

and T257319

Change 644616 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[labs/private@master] add fake mcrouter certs for deploy2002

https://gerrit.wikimedia.org/r/644616

Change 644616 merged by Dzahn:
[labs/private@master] add fake mcrouter certs for deploy2002

https://gerrit.wikimedia.org/r/644616

Icinga downtime for 40 days, 0:00:00 set by dzahn@cumin1001 on 1 host(s) and their services with reason: new_install

deploy2002.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2020-12-01T21:18:51Z] <mutante> applied deployment_server role on deploy2002, added mcrouter cert, initial puppet run pulls mediawiki-config and other repos, downtimed in Icinga for 40 days (T265963)

Mentioned in SAL (#wikimedia-operations) [2020-12-22T21:31:15Z] <mutante> deploy1002/deploy2002 - apt-get remove --purge php-readline and let puppet reinstall it (7.2 vs 7.3 after gerrit 651158) T265963

php-readline 2:7.2+69+0~20190215163918.14+stretch~1.gbpfa617b+wmf1 is installed on the new servers now after deploying https://gerrit.wikimedia.org/r/651158 and letting puppet reinstall it

Change 635079 merged by Dzahn:
[operations/puppet@production] add deploy1002 and deploy2002 to deployment_hosts for firewalls

https://gerrit.wikimedia.org/r/635079

Change 658643 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] scap: add deploy2002 to mediawiki installation hosts

https://gerrit.wikimedia.org/r/658643

Change 658643 merged by Dzahn:
[operations/puppet@production] scap: add deploy1002 and deploy2002 to mediawiki hosts

https://gerrit.wikimedia.org/r/658643

Mentioned in SAL (#wikimedia-operations) [2021-02-25T23:55:13Z] <mutante> deploy1002, deploy2002 - scap-master-sync deploy1001.eqiad.wmnet (T265963)

Change 635114 abandoned by Dzahn:
[operations/dns@master] remove deploy1001.eqiad.wmnet

Reason:
DNS is now automated

https://gerrit.wikimedia.org/r/635114

Dzahn raised the priority of this task from Medium to High.Feb 26 2021, 12:12 AM

Change 667043 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] scap: switch codfw deployment server and scap master to deploy2002

https://gerrit.wikimedia.org/r/667043

Mentioned in SAL (#wikimedia-operations) [2021-02-26T20:29:48Z] <mutante> deploy2001 - /srv/mediawiki-staging sudo find . -name *.cdb delete - deleted 190 GB of old cdb files (T275826 T265963)

Change 667043 merged by Dzahn:
[operations/puppet@production] scap: switch codfw deployment server and scap master to deploy2002

https://gerrit.wikimedia.org/r/667043

Change 667277 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] deployment_server: add the php restart commands here as well

https://gerrit.wikimedia.org/r/667277

Change 667277 abandoned by Dzahn:
[operations/puppet@production] deployment_server: add the php restart commands here as well

Reason:
requires a lot of other things from mediawiki::php

https://gerrit.wikimedia.org/r/667277

Change 667278 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] deployment: allow syncing home dirs to other dpeloyment servers

https://gerrit.wikimedia.org/r/667278

Change 667278 merged by Dzahn:
[operations/puppet@production] deployment: allow syncing home dirs to other dpeloyment servers

https://gerrit.wikimedia.org/r/667278

switching eqiad scheduled for Monday, March 1st:

https://wikitech.wikimedia.org/wiki/Deployments#Monday,_March_01

announced on ops and wikitech-l

Change 635113 merged by Dzahn:
[operations/dns@master] switch deployment CNAME from deploy1001 to deploy1002

https://gerrit.wikimedia.org/r/635113

Change 635105 merged by Dzahn:
[operations/puppet@production] hiera/scap: switch deployment server to deploy1002

https://gerrit.wikimedia.org/r/635105

Mentioned in SAL (#wikimedia-operations) [2021-03-01T21:05:19Z] <mutante> re-enabling puppet on deploy1001 - running puppet on deploy*, switching eqiad scap master and deployment_server globally (T265963)

Mentioned in SAL (#wikimedia-operations) [2021-03-01T21:08:55Z] <mutante> [mwdebug1001:~] $ /usr/local/lib/nagios/plugins/check_mw_versions --deployhost deploy1002.eqiad.wmnet - OKAY: wikiversions in sync (T265963)

Mentioned in SAL (#wikimedia-operations) [2021-03-01T21:38:16Z] <mutante> cumin 'mw*' 'grep master_rsync /etc/scap.cfg' showed all mw servers are now using deploy1002 (T265963)

Change 667718 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/homer/public@master] add deploy1002/deploy2002 to scap firewall

https://gerrit.wikimedia.org/r/667718

Change 667718 merged by Dzahn:
[operations/homer/public@master] add deploy1002/deploy2002 to scap firewall

https://gerrit.wikimedia.org/r/667718

Change 668785 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] mariadb: update grants for deployment servers to clouddb and prod-m5

https://gerrit.wikimedia.org/r/668785

@Andrew Could you please deploy a change for clouddb grants? see comment at the bottom of https://gerrit.wikimedia.org/r/c/operations/puppet/+/668785 Thanks!

Change 668785 abandoned by Dzahn:
[operations/puppet@production] mariadb: update grants for deployment servers to clouddb and prod-m5

Reason:
comments above

https://gerrit.wikimedia.org/r/668785

Change 668785 restored by Dzahn:
[operations/puppet@production] mariadb: update grants for deployment servers to clouddb and prod-m5

https://gerrit.wikimedia.org/r/668785

Change 668785 merged by Kormat:
[operations/puppet@production] mariadb: update grants for deployment servers to prod-m5, drop from clouddb

https://gerrit.wikimedia.org/r/668785