Page MenuHomePhabricator

Service implementation for wdqs101[4,5,6]
Closed, ResolvedPublic

Description

Creating this ticket to track work required to bring wqds hosts wdqs101[4,5,6] into service. These hosts are replacing wdqs100[3-5]; as such one of these hosts (wdqs1016) will be in wdqs-internal to match wdqs1003's old role.

AC

  • wdqs101[4,5] in service in wdqs public
  • wdqs1016 in service in wdqs-internal
    • Change wdqs1016's role from public to internal (site.pp, conftool)
    • Reimage wdqs1016
    • data-transfer to 1016
    • (technically not this ticket) decom wdqs100[3-4] (1005 is already done)

Event Timeline

Change 821785 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs: bring more hosts online

https://gerrit.wikimedia.org/r/821785

Change 821785 merged by Bking:

[operations/puppet@production] wdqs: bring more hosts online

https://gerrit.wikimedia.org/r/821785

Mentioned in SAL (#wikimedia-operations) [2022-08-09T19:55:27Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on wdqs1015.eqiad.wmnet with reason: T314890

Mentioned in SAL (#wikimedia-operations) [2022-08-09T19:55:51Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on wdqs1015.eqiad.wmnet with reason: T314890

Mentioned in SAL (#wikimedia-operations) [2022-08-09T19:56:26Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on wdqs1016.eqiad.wmnet with reason: T314890

Mentioned in SAL (#wikimedia-operations) [2022-08-09T19:56:40Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on wdqs1016.eqiad.wmnet with reason: T314890

Mentioned in SAL (#wikimedia-operations) [2022-08-09T19:57:06Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on wdqs1014.eqiad.wmnet with reason: T314890

Mentioned in SAL (#wikimedia-operations) [2022-08-09T19:57:12Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on wdqs1014.eqiad.wmnet with reason: T314890

On new hosts only, the data transfer cookbook fails on the repooling step. I believe this is because the newly-provisioned server is not yet enabled in the load balancer pool.

To enable it, run the following command from cumin:

confctl select name=wdqs1014.eqiad.wmnet set/weight=10:pooled=yes

Mentioned in SAL (#wikimedia-operations) [2022-08-22T13:37:47Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 3:00:00 on wdqs[1014-1016].eqiad.wmnet with reason: T314890

Mentioned in SAL (#wikimedia-operations) [2022-08-22T13:38:02Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on wdqs[1014-1016].eqiad.wmnet with reason: T314890

This should be finished, closing...

We made a slight mistake: one of these hosts needed to be in wdqs-internal since wdqs1003 (one of the hosts replaced by these new hosts) is.

Re-opening this and updating the task description's AC accordingly.

Steps left to do:

  • Change wdqs1016's role from public to internal (site.pp, conftool)
  • Reimage wdqs1016
  • data-transfer to 1016
  • (technically not this ticket) decom wdqs100[3-4] (1005 is already done)

Oops, meant this to be in progress. Changing status to Open from previous state of Resolved.

Change 955396 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs-internal: switch wdqs1016 from public to internal role

https://gerrit.wikimedia.org/r/955396

Host was reimaged but patch wasn't yet merged. Merging patch and rolling the re-image again.

Change 955396 merged by Ryan Kemper:

[operations/puppet@production] wdqs-internal: switch wdqs1016 from public to internal role

https://gerrit.wikimedia.org/r/955396

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host wdqs1016.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host wdqs1016.eqiad.wmnet with OS bullseye completed:

  • wdqs1016 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Set pooled=inactive for the following services on confctl:

{"wdqs1016.eqiad.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=wdqs-internal,service=wdqs"}

  • Disabled Puppet
  • Removed from Puppet and PuppetDB if present
  • Deleted any existing Puppet certificate
  • Removed from Debmonitor if present
  • Forced PXE for next reboot
  • Host rebooted via IPMI
  • Host up (Debian installer)
  • Checked BIOS boot parameters are back to normal
  • Host up (new fresh bullseye OS)
  • Generated Puppet certificate
  • Signed new Puppet certificate
  • Run Puppet in NOOP mode to populate exported resources in PuppetDB
  • Found Nagios_host resource for this host in PuppetDB
  • Downtimed the new host on Icinga/Alertmanager
  • Removed previous downtime on Alertmanager (old OS)
  • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309080457_ryankemper_1532615_wdqs1016.out
  • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
  • Rebooted
  • Automatic Puppet run was successful
  • Forced a re-check of all Icinga services for the host
  • Icinga status is not optimal, downtime not removed
  • No changes in confctl are needed to restore the previous state.
  • Updated Netbox data from PuppetDB
  • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)
RKemper moved this task from In Progress to Done on the Data-Platform-SRE board.