The following hosts are still on Buster and need to be migrated ASAP:
search-loader1001.eqiad.wmnet
search-loader2001.codfw.wmnet
Creating this ticket to start the process.
The following hosts are still on Buster and need to be migrated ASAP:
search-loader1001.eqiad.wmnet
search-loader2001.codfw.wmnet
Creating this ticket to start the process.
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T291916 Tracking task for Bullseye migrations in production | |||
Resolved | Gehel | T323921 [Epic] Migrate all Search Platform servers to Debian Bullseye | |||
Resolved | bking | T346039 Migrate search-loader hosts to Bullseye or later | |||
Resolved | Gehel | T346272 1 codfw VM requested for search-loader | |||
Resolved | bking | T346273 eqiad: 1 VM requested for search-loader | |||
Resolved | EBernhardson | T346373 Ensure mjolnir can work on Python 3.9 or later | |||
Invalid | None | T350078 Decom search-loader VMs still using Buster | |||
Resolved | bking | T351123 Decommission search-loader1001/2001 VMs | |||
Invalid | bking | T351233 Update search-loader dashboard to reflect new search-loader hosts |
@EBernhardson @dcausse as far as replacing these hosts:
apifeatureusage are logstash hosts so it might be better to ask the o11y team for advises here, regarding search-loaders hosts @EBernhardson might know best but since they should run asynchronous processes and are stateless I suspect that the upgrade should be straightforward and not too risky.
Indeed the search-loader changes should be straightforward. All their state is stored in kafka, if they get restarted they pick back up after the last confirmed message.
JFTR, both of these roles are Ganeti VMs, so it could also be an path to create apifeatureusage[12]002/search-loader[12]002, test and migrate these in parallel to the current production instances and eventually decom the old [12]001 VMs.
Change 957336 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] site.pp: add new search-loader hostnames
Change 957336 merged by Bking:
[operations/puppet@production] site.pp: add new search-loader hostnames
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host search-loader2002.codfw.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host search-loader1002.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host search-loader2002.codfw.wmnet with OS bullseye completed:
Change 957762 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] search-loader: Move new VMs into prod role
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host search-loader1002.eqiad.wmnet with OS bullseye completed:
Change 957762 merged by Bking:
[operations/puppet@production] search-loader: Move new VMs into prod role
Mentioned in SAL (#wikimedia-operations) [2023-09-14T17:05:09Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on search-loader2002.codfw.wmnet,search-loader1002.eqiad.wmnet with reason: T346039
Mentioned in SAL (#wikimedia-operations) [2023-09-14T17:05:23Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on search-loader2002.codfw.wmnet,search-loader1002.eqiad.wmnet with reason: T346039
Found that task when investigating something else.
search-loader2002: The last Puppet run was at Tue Sep 26 06:58:45 UTC 2023 (24831 minutes ago).
It's not really recommended to have a host out of puppet for that long, it seems like it even got evicted from PuppetDB because of innactivity. What's the ETA to have it pick up new Puppet changes? Ideally through a re-image.
@ayounsi sorry for the trouble; this is not a production host. I will move back to insetup and run puppet.
Change 965748 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] search-loader: Move bullseye hosts back to insetup
Change 965748 merged by Bking:
[operations/puppet@production] search-loader: Move bullseye hosts back to insetup
Change 969386 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] search-loader: use default system python
Change 969386 merged by Bking:
[operations/puppet@production] search-loader: use default system python
Change 969754 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] search-loader: Bring new hosts into service
Change 969754 abandoned by Bking:
[operations/puppet@production] search-loader: Bring new hosts into service
Reason:
Change 969761 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] search-loader: Bring new hosts into service
Change 969761 merged by Bking:
[operations/puppet@production] search-loader: Bring new hosts into service
Mentioned in SAL (#wikimedia-operations) [2023-10-30T14:36:52Z] <inflatador> bking@search-loader2001 disabling services as part of bullseye migration T346039
Mentioned in SAL (#wikimedia-operations) [2023-10-30T14:37:52Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on search-loader2001.codfw.wmnet with reason: T346039
Mentioned in SAL (#wikimedia-operations) [2023-10-30T14:38:17Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on search-loader2001.codfw.wmnet with reason: T346039
Mentioned in SAL (#wikimedia-operations) [2023-10-30T14:41:13Z] <bking@deploy2002> Started deploy [search/mjolnir/deploy@daf8c32]: T346039
Mentioned in SAL (#wikimedia-operations) [2023-10-30T14:41:29Z] <bking@deploy2002> Finished deploy [search/mjolnir/deploy@daf8c32]: T346039 (duration: 00m 05s)
Mentioned in SAL (#wikimedia-operations) [2023-10-30T18:16:06Z] <bking@deploy2002> Started deploy [search/mjolnir/deploy@daf8c32]: T346039
Mentioned in SAL (#wikimedia-operations) [2023-10-30T18:16:24Z] <bking@deploy2002> Finished deploy [search/mjolnir/deploy@daf8c32]: T346039 (duration: 00m 06s)
Mentioned in SAL (#wikimedia-operations) [2023-10-30T18:19:45Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on search-loader[2001-2002].codfw.wmnet with reason: T346039
Mentioned in SAL (#wikimedia-operations) [2023-10-30T18:20:10Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on search-loader[2001-2002].codfw.wmnet with reason: T346039
Current state of operations:
I've suppressed alerts for all 4 search-loader hosts for the next day. I have deliberately NOT run puppet on search-loader1001 just in case we have to roll back.
Puppet is failing on bullseye hosts searchloader[12]002 because it is looking for a package name libpython , whereas the package is actually called libpython-${python_minor_vers}.
The package is automatically installed as a dependency when the generic python3 package is installed, so we do not need to specify it in the puppet code. I'll work on a patch now.
Change 969947 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] search-loader: removed unneeded package dep
Change 969947 merged by Bking:
[operations/puppet@production] search-loader: removed unneeded package dep
Change 969968 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] kafka-jumbo: permit traffic from new search-loader VMs
Change 969968 merged by Bking:
[operations/puppet@production] kafka-jumbo: permit traffic from new search-loader VMs
After the firewall patch directly above this comment, it appears that the connection errors for search-loader1002 and 2002 have disappeared .
As such, I have started Puppet and all services on all search-loader VMs, and I've also removed all downtimes. I'm moving this ticket to "Needs Review" so @EBernhardson and/or the Search Platform team can confirm that everything is working.
Per conversation with @Gehel , we might need to do a scream test by shutting off the old instances. Will look into this next week.
The relevant graphs seem reasonable, it looks like work has transferred over to the new instances. The msearch daemon only has work to perform on fridays so haven't seen it do anything, but if the bulk daemon works the msearch one should work similarly.
Mentioned in SAL (#wikimedia-operations) [2023-11-13T21:46:19Z] <inflatador> bking@deploy2002 deploy mjolnir 2.4.0 on newly-built bullseye hosts T346039
Per the last deploy message above, it looks like mjolnir is running successfully under Bullseye and Python 3.10. The next step is to decom the older, buster-based VMs. That work is tracked in T351123.
Change 974211 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):
[operations/puppet@production] Apply Puppet 7 on the role level
Change 974211 merged by Muehlenhoff:
[operations/puppet@production] Apply Puppet 7 on the role level