Page MenuHomePhabricator

Decommission backup1001, backup1002, backup2001, backup2002 (and their arrays)
Closed, ResolvedPublic

Description

Rather than upgrading these hosts, we will decommission by setting up different hosts and upgrade those (or set them up already upgraded for the first time):

  • backup1001 (to be replaced by backup1009)
  • backup1002 (to be replaced by backup1013)
  • backup2001 (to be replaced by backup2009)
  • backup2002 (to be replaced by backup2013)

Pending actions:

  • For decom of backup*001: Migrate director to backup1009 (or backup1013/1014)
  • For decom of backup*002: Backup read only hosts (es1-es5)

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/puppetproduction+0 -5
operations/puppetproduction+0 -5
operations/puppetproduction+11 -154
operations/puppetproduction+13 -10
operations/puppetproduction+9 -0
operations/puppetproduction+6 -6
operations/puppetproduction+127 -111
operations/puppetproduction+27 -0
operations/puppetproduction+4 -1
operations/puppetproduction+1 -0
operations/puppetproduction+1 -1
operations/puppetproduction+18 -87
operations/puppetproduction+7 -0
operations/puppetproduction+26 -27
operations/puppetproduction+38 -1
operations/puppetproduction+37 -1
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
jcrespo changed the task status from Open to In Progress.Mar 4 2025, 4:47 PM
jcrespo triaged this task as Medium priority.
jcrespo updated the task description. (Show Details)

Change #1124487 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Prepare backup1013 to take over eqiad backups of es* dbs

https://gerrit.wikimedia.org/r/1124487

Change #1124487 merged by Jcrespo:

[operations/puppet@production] dbbackups: Prepare backup1013 to take over eqiad backups of es* dbs

https://gerrit.wikimedia.org/r/1124487

Change #1124720 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Prepare backup2013 to take over codfw backups of es* dbs

https://gerrit.wikimedia.org/r/1124720

Change #1124720 merged by Jcrespo:

[operations/puppet@production] dbbackups: Prepare backup2013 to take over codfw backups of es* dbs

https://gerrit.wikimedia.org/r/1124720

Mentioned in SAL (#wikimedia-operations) [2025-03-05T09:18:53Z] <jynus> deploy new backup grants for es1036,es1040 T387892

Mentioned in SAL (#wikimedia-operations) [2025-03-05T09:23:07Z] <jynus> deploy new backup grants for es2036,es2040 T387892

Change #1124738 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Migrate es backups from backup[12]02 to backup[12]13

https://gerrit.wikimedia.org/r/1124738

Change #1124738 merged by Jcrespo:

[operations/puppet@production] dbbackups: Migrate es backups from backup[12]02 to backup[12]13

https://gerrit.wikimedia.org/r/1124738

Mentioned in SAL (#wikimedia-operations) [2025-03-05T15:40:35Z] <jynus> starting es backups on new hosts backup1013, backup2013 T387892

I made a mistake, and it was that I didn't add the new hosts into the production grants of m1 (for backup state tracking). No fatal issue (metadata gathering is optional and won't be a hard failure), but it will mean we will have to fix that tomorrow and test it again.

Change #1124834 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Add additional m1 grants for backup[12]013 stats user

https://gerrit.wikimedia.org/r/1124834

Mentioned in SAL (#wikimedia-operations) [2025-03-06T09:28:35Z] <jynus> deploy additional grants to m1 T387892

Change #1124834 merged by Jcrespo:

[operations/puppet@production] dbbackups: Add additional m1 grants for backup[12]013 stats user

https://gerrit.wikimedia.org/r/1124834

Change #1125114 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Prepare backup1002, backup2002 for decommissioning

https://gerrit.wikimedia.org/r/1125114

Mentioned in SAL (#wikimedia-operations) [2025-03-12T10:14:28Z] <jynus> removing backup1002, backup2002 dump user on es6,es7 T387892

Mentioned in SAL (#wikimedia-operations) [2025-03-12T10:42:21Z] <jynus> removing backup1002, backup2002 dbbackups user @ m1 T387892

Change #1125114 merged by Jcrespo:

[operations/puppet@production] dbbackups: Prepare backup1002, backup2002 for decommissioning

https://gerrit.wikimedia.org/r/1125114

Change #1160691 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] bacula: Migrate backup1001's director role to backup1014

https://gerrit.wikimedia.org/r/1160691

Mentioned in SAL (#wikimedia-operations) [2025-06-18T10:14:49Z] <jynus> starting backup director migration backup1001 -> backup1014 T387892

Change #1160691 merged by Jcrespo:

[operations/puppet@production] bacula: Migrate backup1001's director role to backup1014

https://gerrit.wikimedia.org/r/1160691

Icinga downtime and Alertmanager silence (ID=6ac28898-4cc3-4a9a-b6a4-de2ec38f7f20) set by jynus@cumin1002 for 4:00:00 on 2 host(s) and their services with reason: Backup director migration

backup[1001,1014].eqiad.wmnet

Change #1160730 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] bacula: Update wrong role for backup2009

https://gerrit.wikimedia.org/r/1160730

Change #1160730 merged by Jcrespo:

[operations/puppet@production] bacula: Update wrong role for backup2009

https://gerrit.wikimedia.org/r/1160730

Change #1160735 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] bacula: Force puppet7 on backup1009

https://gerrit.wikimedia.org/r/1160735

Change #1160735 merged by Jcrespo:

[operations/puppet@production] bacula: Force puppet7 on backup1009

https://gerrit.wikimedia.org/r/1160735

Change #1160739 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] bacula: Migrate and create general properties for the new/renamed roles

https://gerrit.wikimedia.org/r/1160739

Change #1160739 merged by Jcrespo:

[operations/puppet@production] bacula: Migrate and create general properties for the new/renamed roles

https://gerrit.wikimedia.org/r/1160739

Change #1160756 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] bacula: Discourage the usage of backup1001 as director

https://gerrit.wikimedia.org/r/1160756

Change #1160756 merged by Jcrespo:

[operations/puppet@production] bacula: Discourage the usage of backup1001 as director

https://gerrit.wikimedia.org/r/1160756

Bacula director migrated, with some bumps along the way.
Documentation updated.
Issue updated to say:

Debian GNU/Linux 11 (bullseye)

     _         _   _  ___ _____                    _   _     _
  __| | ___   | \ | |/ _ \_   _|  _   _ ___  ___  | |_| |__ (_)___
/  _` |/ _ \  |  \| | | | || |   | | | / __|/ _ \ | __| '_ \| / __|
| (_| | (_) | | |\  | |_| || |   | |_| \__ \  __/ | |_| | | | \__ \
 \__,_|\___/  |_| \_|\___/ |_|    \__,_|___/\___|  \__|_| |_|_|___/

                              _
 ___  ___ _ ____   _____ _ __| |
/ __|/ _ \ '__\ \ / / _ \ '__| |
\__ \  __/ |   \ V /  __/ |  |_|
|___/\___|_|    \_/ \___|_|  (_)

backup1001 is no longer the active bacula director for WMF fails.

Use backup1014 instead !!!

Mentioned in SAL (#wikimedia-operations) [2025-06-18T13:00:40Z] <jynus> bacula director migration finalized, backup1014 is the new bacula director. backup1001 should no longer be used. T387892

backup1001 and backup2001 are also closer to decom, I need first to review the storage contents to make sure I don't miss anything not migrated (at least the archive storage is pending), and we will be able to run the script.

regarding *002 hosts, I am still waiting on dbas to confirm the refresh of es read only backups.

Mentioned in SAL (#wikimedia-operations) [2025-06-19T07:37:45Z] <jynus> just started es read only backup regeneration T387892

Mentioned in SAL (#wikimedia-operations) [2025-06-23T15:47:13Z] <jynus> drop backup users from es1-es5 hosts T387892

Change #1163645 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Enable temporarily read only backups for refresh

https://gerrit.wikimedia.org/r/1163645

Change #1163645 merged by Jcrespo:

[operations/puppet@production] dbbackups: Enable temporarily read only backups for refresh

https://gerrit.wikimedia.org/r/1163645

Change #1163694 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Disable read only backups and reenable regular rw es backups

https://gerrit.wikimedia.org/r/1163694

Change #1164158 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] bacula: Create a temporary backup job for long term Archival

https://gerrit.wikimedia.org/r/1164158

Change #1164158 merged by Jcrespo:

[operations/puppet@production] bacula: Create a temporary backup job for long term Archival

https://gerrit.wikimedia.org/r/1164158

Change #1163694 merged by Jcrespo:

[operations/puppet@production] dbbackups: Disable read only backups and reenable regular rw es backups

https://gerrit.wikimedia.org/r/1163694

backups for es ro are almost complete (except es5 for backup2013), almost ready to decom backup[12]002.

Migration of archival files from backup1001 also ongoing and should finish today.

This should unblock decommissioning of all hosts for next week.

Change #1164981 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] bacula: Remove oldmain and olddirector roles, prepare for decom backup[12]01

https://gerrit.wikimedia.org/r/1164981

Change #1164981 merged by Jcrespo:

[operations/puppet@production] bacula: Remove oldmain and olddirector roles, prepare for decom backup[12]01

https://gerrit.wikimedia.org/r/1164981

Change #1165822 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] bacula: Remove backup1001 old backup director host from puppet

https://gerrit.wikimedia.org/r/1165822

Change #1165823 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] bacula: Remove backup2001, old offsite backup host

https://gerrit.wikimedia.org/r/1165823

Change #1165823 merged by Jcrespo:

[operations/puppet@production] bacula: Remove backup2001, old offsite backup host

https://gerrit.wikimedia.org/r/1165823

Change #1165822 merged by Jcrespo:

[operations/puppet@production] bacula: Remove backup1001 old backup director host from puppet

https://gerrit.wikimedia.org/r/1165822

This is now finished, with just some operations to be done by dc-ops on the children tasks. From a backup perspective, the migration is complete.