Page MenuHomePhabricator

Migrate search-loader hosts to Bullseye or later
Closed, ResolvedPublic

Description

The following hosts are still on Buster and need to be migrated ASAP:

search-loader1001.eqiad.wmnet
search-loader2001.codfw.wmnet

Creating this ticket to start the process.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@EBernhardson @dcausse as far as replacing these hosts:

  • Is there a way to test these hosts ahead of time? I didn't see apifeatureusage or search-loader hosts in deployment-prep.
  • What is the recommended procedure? I was just going to follow a typical one-host-at-a-time replacement. If this needs to happen during a maintenance window, and/or if there are additional steps after reimaging (such as restoring data), please let us know.

apifeatureusage are logstash hosts so it might be better to ask the o11y team for advises here, regarding search-loaders hosts @EBernhardson might know best but since they should run asynchronous processes and are stateless I suspect that the upgrade should be straightforward and not too risky.

Indeed the search-loader changes should be straightforward. All their state is stored in kafka, if they get restarted they pick back up after the last confirmed message.

bking renamed this task from Migrate apifeatureusage/search-loader hosts to Bullseye or later to Migrate search-loader hosts to Bullseye or later.Sep 11 2023, 3:21 PM
bking updated the task description. (Show Details)

@EBernhardson @dcausse as far as replacing these hosts:

  • Is there a way to test these hosts ahead of time? I didn't see apifeatureusage or search-loader hosts in deployment-prep.
  • What is the recommended procedure? I was just going to follow a typical one-host-at-a-time replacement. If this needs to happen during a maintenance window, and/or if there are additional steps after reimaging (such as restoring data), please let us know.

JFTR, both of these roles are Ganeti VMs, so it could also be an path to create apifeatureusage[12]002/search-loader[12]002, test and migrate these in parallel to the current production instances and eventually decom the old [12]001 VMs.

Phab can only do parent/child task relationships, but I wanted to call out the following:

  • T346053 : discussing apifeatureusage upgrades. @Gehel suggested separate tickets between apifeatureusage/searchloader.
  • T346189 , where we're going to discuss migrating the search-loader service to Kubernetes.

Change 957336 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] site.pp: add new search-loader hostnames

https://gerrit.wikimedia.org/r/957336

Change 957336 merged by Bking:

[operations/puppet@production] site.pp: add new search-loader hostnames

https://gerrit.wikimedia.org/r/957336

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host search-loader2002.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host search-loader1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host search-loader2002.codfw.wmnet with OS bullseye completed:

  • search-loader2002 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309141523_bking_73363_search-loader2002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 957762 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] search-loader: Move new VMs into prod role

https://gerrit.wikimedia.org/r/957762

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host search-loader1002.eqiad.wmnet with OS bullseye completed:

  • search-loader1002 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309141531_bking_81910_search-loader1002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 957762 merged by Bking:

[operations/puppet@production] search-loader: Move new VMs into prod role

https://gerrit.wikimedia.org/r/957762

Mentioned in SAL (#wikimedia-operations) [2023-09-14T17:05:09Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on search-loader2002.codfw.wmnet,search-loader1002.eqiad.wmnet with reason: T346039

Mentioned in SAL (#wikimedia-operations) [2023-09-14T17:05:23Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on search-loader2002.codfw.wmnet,search-loader1002.eqiad.wmnet with reason: T346039

Looks like we still need to decommission the old VMs

Found that task when investigating something else.

search-loader2002: The last Puppet run was at Tue Sep 26 06:58:45 UTC 2023 (24831 minutes ago).

It's not really recommended to have a host out of puppet for that long, it seems like it even got evicted from PuppetDB because of innactivity. What's the ETA to have it pick up new Puppet changes? Ideally through a re-image.

@ayounsi sorry for the trouble; this is not a production host. I will move back to insetup and run puppet.

Change 965748 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] search-loader: Move bullseye hosts back to insetup

https://gerrit.wikimedia.org/r/965748

Change 965748 merged by Bking:

[operations/puppet@production] search-loader: Move bullseye hosts back to insetup

https://gerrit.wikimedia.org/r/965748

Change 969386 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] search-loader: use default system python

https://gerrit.wikimedia.org/r/969386

Change 969386 merged by Bking:

[operations/puppet@production] search-loader: use default system python

https://gerrit.wikimedia.org/r/969386

Change 969754 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] search-loader: Bring new hosts into service

https://gerrit.wikimedia.org/r/969754

Change 969754 abandoned by Bking:

[operations/puppet@production] search-loader: Bring new hosts into service

Reason:

https://gerrit.wikimedia.org/r/969754

Change 969761 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] search-loader: Bring new hosts into service

https://gerrit.wikimedia.org/r/969761

Change 969761 merged by Bking:

[operations/puppet@production] search-loader: Bring new hosts into service

https://gerrit.wikimedia.org/r/969761

Mentioned in SAL (#wikimedia-operations) [2023-10-30T14:36:52Z] <inflatador> bking@search-loader2001 disabling services as part of bullseye migration T346039

Mentioned in SAL (#wikimedia-operations) [2023-10-30T14:37:52Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on search-loader2001.codfw.wmnet with reason: T346039

Mentioned in SAL (#wikimedia-operations) [2023-10-30T14:38:17Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on search-loader2001.codfw.wmnet with reason: T346039

Mentioned in SAL (#wikimedia-operations) [2023-10-30T14:41:13Z] <bking@deploy2002> Started deploy [search/mjolnir/deploy@daf8c32]: T346039

Mentioned in SAL (#wikimedia-operations) [2023-10-30T14:41:29Z] <bking@deploy2002> Finished deploy [search/mjolnir/deploy@daf8c32]: T346039 (duration: 00m 05s)

Mentioned in SAL (#wikimedia-operations) [2023-10-30T18:16:06Z] <bking@deploy2002> Started deploy [search/mjolnir/deploy@daf8c32]: T346039

Mentioned in SAL (#wikimedia-operations) [2023-10-30T18:16:24Z] <bking@deploy2002> Finished deploy [search/mjolnir/deploy@daf8c32]: T346039 (duration: 00m 06s)

Mentioned in SAL (#wikimedia-operations) [2023-10-30T18:19:45Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on search-loader[2001-2002].codfw.wmnet with reason: T346039

Mentioned in SAL (#wikimedia-operations) [2023-10-30T18:20:10Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on search-loader[2001-2002].codfw.wmnet with reason: T346039

Current state of operations:

I've suppressed alerts for all 4 search-loader hosts for the next day. I have deliberately NOT run puppet on search-loader1001 just in case we have to roll back.

Puppet is failing on bullseye hosts searchloader[12]002 because it is looking for a package name libpython , whereas the package is actually called libpython-${python_minor_vers}.
The package is automatically installed as a dependency when the generic python3 package is installed, so we do not need to specify it in the puppet code. I'll work on a patch now.

Change 969947 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] search-loader: removed unneeded package dep

https://gerrit.wikimedia.org/r/969947

Change 969947 merged by Bking:

[operations/puppet@production] search-loader: removed unneeded package dep

https://gerrit.wikimedia.org/r/969947

Change 969968 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] kafka-jumbo: permit traffic from new search-loader VMs

https://gerrit.wikimedia.org/r/969968

Change 969968 merged by Bking:

[operations/puppet@production] kafka-jumbo: permit traffic from new search-loader VMs

https://gerrit.wikimedia.org/r/969968

After the firewall patch directly above this comment, it appears that the connection errors for search-loader1002 and 2002 have disappeared .

As such, I have started Puppet and all services on all search-loader VMs, and I've also removed all downtimes. I'm moving this ticket to "Needs Review" so @EBernhardson and/or the Search Platform team can confirm that everything is working.

Gehel triaged this task as Medium priority.Nov 3 2023, 10:29 AM

Per conversation with @Gehel , we might need to do a scream test by shutting off the old instances. Will look into this next week.

The relevant graphs seem reasonable, it looks like work has transferred over to the new instances. The msearch daemon only has work to perform on fridays so haven't seen it do anything, but if the bulk daemon works the msearch one should work similarly.

Mentioned in SAL (#wikimedia-operations) [2023-11-13T21:46:19Z] <inflatador> bking@deploy2002 deploy mjolnir 2.4.0 on newly-built bullseye hosts T346039

Per the last deploy message above, it looks like mjolnir is running successfully under Bullseye and Python 3.10. The next step is to decom the older, buster-based VMs. That work is tracked in T351123.

Change 974211 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Apply Puppet 7 on the role level

https://gerrit.wikimedia.org/r/974211

Change 974211 merged by Muehlenhoff:

[operations/puppet@production] Apply Puppet 7 on the role level

https://gerrit.wikimedia.org/r/974211