Page MenuHomePhabricator

Migrate Cloudelastic to OpenSearch 2.x
Closed, ResolvedPublic

Description

Since we can't use our typical test cluster (relforge) as it's occupied by a Semantic Search experiment (ref T413969 ), we will start our OpenSearch 2.x migration on the cloudelastic cluster. Creating this ticket to:

  • Update the cluster from OpenSearch 1.x->2.x
  • Document any lessons learned: Added docs on our use of systemd for plugin management

Details

Other Assignee
RKemper
Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/puppetproduction+0 -4
operations/puppetproduction+4 -2
operations/puppetproduction+2 -0
operations/puppetproduction+22 -80
operations/puppetproduction+2 -0
operations/puppetproduction+2 -2
operations/puppetproduction+6 -6
operations/puppetproduction+15 -0
operations/puppetproduction+16 -8
operations/puppetproduction+56 -1
operations/puppetproduction+2 -1
operations/puppetproduction+1 -5
operations/puppetproduction+5 -1
operations/puppetproduction+0 -297
operations/puppetproduction+13 -0
operations/puppetproduction+11 -4
operations/puppetproduction+17 -2
operations/puppetproduction+17 -2
operations/puppetproduction+1 -1
operations/puppetproduction+4 -2
operations/puppetproduction+29 -0
operations/puppetproduction+25 -0
operations/puppetproduction+54 -13
operations/puppetproduction+17 -0
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+20 -0
operations/puppetproduction+0 -1
operations/puppetproduction+0 -1
operations/puppetproduction+1 -5
operations/puppetproduction+6 -2
operations/puppetproduction+32 -4
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1270511 merged by Bking:

[operations/puppet@production] opensearch: hack around upstream 2.x+ packages

https://gerrit.wikimedia.org/r/1270511

Change #1270953 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] opensearch: correct o11y usage in comment

https://gerrit.wikimedia.org/r/1270953

Change #1270953 merged by Cwhite:

[operations/puppet@production] opensearch: correct o11y usage in comment

https://gerrit.wikimedia.org/r/1270953

Change #1271473 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] opensearch: strip bundled plugins before WMF pkg

https://gerrit.wikimedia.org/r/1271473

@bking Wrt the issues with the broken import of the madvise package; I went ahead and rebuilt it as 0.2+deb13u1. While the only dependency of that package is in glibc with a stable ABI, it's still preferable to rebuild it with GCC 15 from Trixie. The new version also resolves the versioning/import issue. I've synced the debs to my home on cloudelastic1012, but didn't install them yet since I didn't want to meddle with any ongoing tests of you. When the time is right, please install them on 1012 and if they are fine, I'll import them to apt.w.o.

@MoritzMuehlenhoff, I've installed the packages as you requested and I can confirm they installed cleanly. Feel free to publish them to the repos.

Thanks for your help!

Change #1271473 merged by Bking:

[operations/puppet@production] opensearch: strip bundled plugins before WMF pkg

https://gerrit.wikimedia.org/r/1271473

Change #1271818 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: fix java path typo

https://gerrit.wikimedia.org/r/1271818

Change #1271818 merged by Bking:

[operations/puppet@production] cloudelastic: fix java path typo

https://gerrit.wikimedia.org/r/1271818

Icinga downtime and Alertmanager silence (ID=396a17ce-b27d-41be-a6ce-921c607989da) set by bking@cumin2002 for 7 days, 0:00:00 on 1 host(s) and their services with reason: still fixing Puppet

cloudelastic1012.eqiad.wmnet

Change #1271929 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: temporarily add "working typos" for plugins

https://gerrit.wikimedia.org/r/1271929

Change #1271929 merged by Bking:

[operations/puppet@production] cloudelastic: temporarily add "working typos" for plugins

https://gerrit.wikimedia.org/r/1271929

Change #1271947 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] opensearch: allowlist upstream-only plugins

https://gerrit.wikimedia.org/r/1271947

Mentioned in SAL (#wikimedia-operations) [2026-04-16T06:55:17Z] <moritzm> imported opensearch-madvise 0.2+deb13u1 to component/opensearch2 of trixie-wikimedia T422860

@MoritzMuehlenhoff, I've installed the packages as you requested and I can confirm they installed cleanly. Feel free to publish them to the repos.

Nice! I've imported the new package into component/opensearch2 for trixie-wikimedia

Change #1273887 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] OpenSearch: Control which plugins we use via systemd PrivateMounts

https://gerrit.wikimedia.org/r/1273887

Change #1273887 merged by Bking:

[operations/puppet@production] OpenSearch: Control which plugins we use via systemd PrivateMounts

https://gerrit.wikimedia.org/r/1273887

Change #1273937 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] opensearch: move var up so we can use it earlier

https://gerrit.wikimedia.org/r/1273937

Change #1273937 merged by Bking:

[operations/puppet@production] opensearch: move var up so we can use it earlier

https://gerrit.wikimedia.org/r/1273937

Change #1273943 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic1012: remove the deliberately-introduced typo

https://gerrit.wikimedia.org/r/1273943

Change #1273943 merged by Bking:

[operations/puppet@production] cloudelastic1012: remove the deliberately-introduced typo

https://gerrit.wikimedia.org/r/1273943

Change #1274061 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] cloudelastic1012: override common_settings merge to first

https://gerrit.wikimedia.org/r/1274061

Change #1274075 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] cloudelastic1012: full common_settings override for OS2

https://gerrit.wikimedia.org/r/1274075

Change #1274075 abandoned by Ryan Kemper:

[operations/puppet@production] cloudelastic1012: full common_settings override for OS2

Reason:

meant to update https://gerrit.wikimedia.org/r/c/operations/puppet/+/1274061; abandoning

https://gerrit.wikimedia.org/r/1274075

Change #1274061 merged by Ryan Kemper:

[operations/puppet@production] cloudelastic1012: full common_settings override for OS2

https://gerrit.wikimedia.org/r/1274061

Change #1274134 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] cloudelastic1012: full common_settings override for OS2

https://gerrit.wikimedia.org/r/1274134

Change #1274134 merged by Ryan Kemper:

[operations/puppet@production] cloudelastic1012: full common_settings override for OS2

https://gerrit.wikimedia.org/r/1274134

Change #1275435 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic1012: Set LVS config for opensearch_2

https://gerrit.wikimedia.org/r/1275435

Change #1275435 merged by Bking:

[operations/puppet@production] cloudelastic1012: Set LVS config for opensearch_2

https://gerrit.wikimedia.org/r/1275435

Change #1275444 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] Cirrussearch: remove unused hiera files

https://gerrit.wikimedia.org/r/1275444

Change #1275444 merged by Bking:

[operations/puppet@production] Cirrussearch: remove unused hiera files

https://gerrit.wikimedia.org/r/1275444

Change #1275473 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic1012: move back to insetup

https://gerrit.wikimedia.org/r/1275473

Change #1275473 merged by Bking:

[operations/puppet@production] cloudelastic1012: move back to insetup

https://gerrit.wikimedia.org/r/1275473

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1012.eqiad.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1012.eqiad.wmnet with OS trixie completed:

  • cloudelastic1012 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202604201718_bking_545568_cloudelastic1012.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1275485 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic1012: move back to production role

https://gerrit.wikimedia.org/r/1275485

Change #1275485 merged by Bking:

[operations/puppet@production] cloudelastic1012: move back to production role

https://gerrit.wikimedia.org/r/1275485

After reimaging cloudelastic2012, it is back up and ready for testing.

I've removed it from load balancer rotation, shut off Puppet, and have stopped all instances except psi (port 9600), which I'm using as our guinea pig.

So far, I've gotten the following errors:

Caused by: java.lang.IllegalStateException: index [.ltrstore/vCo9DZu5Qt-3QbtmBy1d7Q] version not supported: 6.5.4 minimum compatible index version is: 7.

This index is part of OpenSearch's machine learning/Learning to Rank feature set . It is not used in cloudelastic, but for production we may have to do something like*:

  • create a new named store
  • reload the data (maybe via reindex api, needs testing),

-repoint queries at the new feature store

  • get rid of the old one

The next error I've seen is very similar:

java.lang.IllegalStateException: index [mw_cirrus_metastore_1659365741/ugKwuXOpRjiti8dY67m9OA] version not supported: 6.8.23 minimum compatibl

Per codesearch , cirrussearch (the Mediawiki extension that provides OpenSearch support) uses the mw_cirrus_metastore index to store the state of administrative tasks. We're still working out a plan to upgrade this index gracefully as I write this.

*suggested by @EBernhardson in Wikimedia-Search IRC

We had a few more indices to delete before the existing OpenSearch 1.x clusters would allow an OpenSearch 2 node to join. We can find the problem indices with this one-liner:

curl -s localhost:${PORT}/_all/_settings | jq -r 'to_entries[] | "\(.key) \(.value.settings.index.version.created)"' | grep -v 135249827
(135249827 means the index was created on OpenSearch 1, anything not matching that will be a problem).

The problem indices for Cloudelastic were:

  • .ltrstore as described above
  • mw_cirrus_metastore also described above
  • .tasks - used internally by OpenSearch to keep track of running tasks. Safe enough to delete in most circumstances (if you just lost a bunch of data and were waiting for OpenSearch to recreate shards, probably not).

We will have to be a bit more cautious for the production clusters, but I think just need a few reimages to get Cloudelastic onto OpenSearch 2.x.

Change #1275535 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] prometheus: fix wmf-elasticsearch-exporter listen address on Trixie

https://gerrit.wikimedia.org/r/1275535

Change #1275535 merged by Ryan Kemper:

[operations/puppet@production] prometheus: fix wmf-elasticsearch-exporter listen address on Trixie

https://gerrit.wikimedia.org/r/1275535

Note that we also ran into a problem with the prometheus exporter and Python 3.13, which comes with Trixie. @RKemper 's patch above fixes that.

Change #1276804 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: prepare cloudelastic1011 for Trixie/OpenSearch 2

https://gerrit.wikimedia.org/r/1276804

Change #1276804 merged by Bking:

[operations/puppet@production] cloudelastic: prepare cloudelastic1011 for Trixie/OpenSearch 2

https://gerrit.wikimedia.org/r/1276804

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1011.eqiad.wmnet with OS trixie

Change #1276818 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: set role-level hiera for OpenSearch 2/Trixie

https://gerrit.wikimedia.org/r/1276818

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1011.eqiad.wmnet with OS trixie completed:

  • cloudelastic1011 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202604232154_bking_3545146_cloudelastic1011.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1010.eqiad.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1010.eqiad.wmnet with OS trixie executed with errors:

  • cloudelastic1010 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudelastic1010.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Change #1277180 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] opensearch 1: parameterize disk threshold values

https://gerrit.wikimedia.org/r/1277180

Change #1277180 merged by Bking:

[operations/puppet@production] opensearch 1: parameterize disk threshold values/up limits in ce

https://gerrit.wikimedia.org/r/1277180

Change #1277194 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] cloudelastic: add missing extra-analysis plugins

https://gerrit.wikimedia.org/r/1277194

Change #1277194 merged by Bking:

[operations/puppet@production] cloudelastic: add missing extra-analysis plugins

https://gerrit.wikimedia.org/r/1277194

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1010.eqiad.wmnet with OS trixie

Change #1276818 merged by Bking:

[operations/puppet@production] cloudelastic: set role-level hiera for OpenSearch 2/Trixie

https://gerrit.wikimedia.org/r/1276818

We're making progress! I'm still seeing puppet failures around the installation of the opensearch-madvise package. The package seems to work when installed manually, so I'll take a look at the Puppet code next.

Change #1277614 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: update allowed plugins list

https://gerrit.wikimedia.org/r/1277614

Change #1277614 merged by Bking:

[operations/puppet@production] cloudelastic: update allowed plugins list

https://gerrit.wikimedia.org/r/1277614

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1010.eqiad.wmnet with OS trixie completed:

  • cloudelastic1010 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202604271531_bking_2772828_cloudelastic1010.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Change #1277640 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: Another required plugins update patch

https://gerrit.wikimedia.org/r/1277640

Change #1277640 merged by Bking:

[operations/puppet@production] cloudelastic: Another required plugins update patch

https://gerrit.wikimedia.org/r/1277640

Change #1277687 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: Another ugly plugins patch

https://gerrit.wikimedia.org/r/1277687

Change #1277687 merged by Bking:

[operations/puppet@production] cloudelastic: Another ugly plugins patch

https://gerrit.wikimedia.org/r/1277687

Change #1277692 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: Add back the opensearch-ltr plugin

https://gerrit.wikimedia.org/r/1277692

Change #1277692 merged by Bking:

[operations/puppet@production] cloudelastic: Add back the opensearch-ltr plugin

https://gerrit.wikimedia.org/r/1277692

Change #1277704 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: remove a plugin path that doesn't exist

https://gerrit.wikimedia.org/r/1277704

Change #1277704 merged by Bking:

[operations/puppet@production] cloudelastic: remove a plugin path that doesn't exist

https://gerrit.wikimedia.org/r/1277704

Change #1277708 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: get rid of merge overrides

https://gerrit.wikimedia.org/r/1277708

Change #1277708 merged by Bking:

[operations/puppet@production] cloudelastic: get rid of merge overrides

https://gerrit.wikimedia.org/r/1277708

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1009.eqiad.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1009.eqiad.wmnet with OS trixie completed:

  • cloudelastic1009 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202604282120_bking_3918437_cloudelastic1009.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host cloudelastic1008.eqiad.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host cloudelastic1008.eqiad.wmnet with OS trixie completed:

  • cloudelastic1008 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Set pooled=inactive for the following services on confctl:

{"cloudelastic1008.eqiad.wmnet": {"weight": 10, "pooled": "no"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-omega-ssl"}
{"cloudelastic1008.eqiad.wmnet": {"weight": 10, "pooled": "no"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-omega-ssl-public"}
{"cloudelastic1008.eqiad.wmnet": {"weight": 10, "pooled": "no"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-psi-ssl"}
{"cloudelastic1008.eqiad.wmnet": {"weight": 10, "pooled": "no"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-psi-ssl-public"}
{"cloudelastic1008.eqiad.wmnet": {"weight": 10, "pooled": "no"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-chi-ssl"}
{"cloudelastic1008.eqiad.wmnet": {"weight": 10, "pooled": "no"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-chi-ssl-public"}

  • Disabled Puppet
  • Removed from Puppet and PuppetDB if present and deleted any certificates
  • Removed from Debmonitor if present
  • Forced PXE for next reboot
  • Host rebooted via IPMI
  • Host up (Debian installer)
  • Checked BIOS boot parameters are back to normal
  • Host up (new fresh trixie OS)
  • Generated Puppet certificate
  • Signed new Puppet certificate
  • Run Puppet in NOOP mode to populate exported resources in PuppetDB
  • Found Nagios_host resource for this host in PuppetDB
  • Downtimed the new host on Icinga/Alertmanager
  • Removed previous downtime on Alertmanager (old OS)
  • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202604290255_ryankemper_4135350_cloudelastic1008.out
  • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
  • Rebooted
  • Automatic Puppet run was successful
  • Forced a re-check of all Icinga services for the host
  • Icinga status is not optimal, downtime not removed
  • Services in confctl are not automatically pooled, to restore the previous state you have to run the following commands:

sudo confctl select 'name=cloudelastic1008\.eqiad\.wmnet,dc=eqiad,cluster=cloudelastic,service=cloudelastic\-omega\-ssl' set/pooled=no
sudo confctl select 'name=cloudelastic1008\.eqiad\.wmnet,dc=eqiad,cluster=cloudelastic,service=cloudelastic\-omega\-ssl\-public' set/pooled=no
sudo confctl select 'name=cloudelastic1008\.eqiad\.wmnet,dc=eqiad,cluster=cloudelastic,service=cloudelastic\-psi\-ssl' set/pooled=no
sudo confctl select 'name=cloudelastic1008\.eqiad\.wmnet,dc=eqiad,cluster=cloudelastic,service=cloudelastic\-psi\-ssl\-public' set/pooled=no
sudo confctl select 'name=cloudelastic1008\.eqiad\.wmnet,dc=eqiad,cluster=cloudelastic,service=cloudelastic\-chi\-ssl' set/pooled=no
sudo confctl select 'name=cloudelastic1008\.eqiad\.wmnet,dc=eqiad,cluster=cloudelastic,service=cloudelastic\-chi\-ssl\-public' set/pooled=no

  • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host cloudelastic1007.eqiad.wmnet with OS trixie

Mentioned in SAL (#wikimedia-operations) [2026-04-29T07:39:28Z] <ryankemper> T422860 [cloudelastic] Restarted opensearch services on cloudelastic1011 and cloudelastic1012 (needed to pick up missing opensearch plugins, which have already been fixed in puppet) (note: this was done ~2h ago; logged in wrong channel)

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host cloudelastic1007.eqiad.wmnet with OS trixie executed with errors:

  • cloudelastic1007 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Set pooled=inactive for the following services on confctl:

{"cloudelastic1007.eqiad.wmnet": {"weight": 10, "pooled": "no"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-psi-ssl"}
{"cloudelastic1007.eqiad.wmnet": {"weight": 10, "pooled": "no"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-psi-ssl-public"}
{"cloudelastic1007.eqiad.wmnet": {"weight": 10, "pooled": "no"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-chi-ssl"}
{"cloudelastic1007.eqiad.wmnet": {"weight": 10, "pooled": "no"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-chi-ssl-public"}
{"cloudelastic1007.eqiad.wmnet": {"weight": 10, "pooled": "no"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-omega-ssl"}
{"cloudelastic1007.eqiad.wmnet": {"weight": 10, "pooled": "no"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-omega-ssl-public"}

  • Disabled Puppet
  • Removed from Puppet and PuppetDB if present and deleted any certificates
  • Removed from Debmonitor if present
  • Forced PXE for next reboot
  • Host rebooted via IPMI
  • Host up (Debian installer)
  • Checked BIOS boot parameters are back to normal
  • Services in confctl are not automatically pooled, to restore the previous state you have to run the following commands:

sudo confctl select 'name=cloudelastic1007\.eqiad\.wmnet,dc=eqiad,cluster=cloudelastic,service=cloudelastic\-psi\-ssl' set/pooled=no
sudo confctl select 'name=cloudelastic1007\.eqiad\.wmnet,dc=eqiad,cluster=cloudelastic,service=cloudelastic\-psi\-ssl\-public' set/pooled=no
sudo confctl select 'name=cloudelastic1007\.eqiad\.wmnet,dc=eqiad,cluster=cloudelastic,service=cloudelastic\-chi\-ssl' set/pooled=no
sudo confctl select 'name=cloudelastic1007\.eqiad\.wmnet,dc=eqiad,cluster=cloudelastic,service=cloudelastic\-chi\-ssl\-public' set/pooled=no
sudo confctl select 'name=cloudelastic1007\.eqiad\.wmnet,dc=eqiad,cluster=cloudelastic,service=cloudelastic\-omega\-ssl' set/pooled=no
sudo confctl select 'name=cloudelastic1007\.eqiad\.wmnet,dc=eqiad,cluster=cloudelastic,service=cloudelastic\-omega\-ssl\-public' set/pooled=no

  • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudelastic1007.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1007.eqiad.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1007.eqiad.wmnet with OS trixie completed:

  • cloudelastic1007 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202604291359_bking_385235_cloudelastic1007.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

All Cloudelastic hosts are on OpenSearch 2.x/Debian Trixie. We'll focus on the production clusters as time permits; see parent ticket for further updates.

bking updated the task description. (Show Details)