Creating this ticket to:
- Bring hosts elastic2087-2109 into service: 5 net-new hosts, 18 refresh
- Decom elastic20[37-54]
Puppet code to enable Puppet 7 on these new hosts was added here
Creating this ticket to:
Puppet code to enable Puppet 7 on these new hosts was added here
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T353392 Ensure Elastic stack works on bookworm | |||
Resolved | bking | T353878 Service implementation for elastic2087-2109 | |||
Resolved | BTullis | T355830 Hardware error on elastic2094 - Comm Error: Backplane 0. | |||
Resolved | bking | T358882 Decommission elastic2037-2054 | |||
Resolved | Request | bking | T313842 Decommission elastic2049.codfw.wmnet | ||
Resolved | Request | Jhancock.wm | T361305 decommission elastic20[37-54].codfw.wmnet |
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2096.codfw.wmnet with OS bullseye completed:
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2097.codfw.wmnet with OS bullseye completed:
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic1103.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic1104.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2094.codfw.wmnet with OS bullseye executed with errors:
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic1105.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic1106.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye executed with errors:
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2094.codfw.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic1104.eqiad.wmnet with OS bullseye completed:
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic1105.eqiad.wmnet with OS bullseye completed:
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic1106.eqiad.wmnet with OS bullseye completed:
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic1103.eqiad.wmnet with OS bullseye executed with errors:
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2094.codfw.wmnet with OS bullseye executed with errors:
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic1103.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic1107.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic1103.eqiad.wmnet with OS bullseye completed:
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic1107.eqiad.wmnet with OS bullseye completed:
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2094.codfw.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye executed with errors:
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2094.codfw.wmnet with OS bullseye executed with errors:
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2094.codfw.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye executed with errors:
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2094.codfw.wmnet with OS bullseye executed with errors:
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye executed with errors:
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye executed with errors:
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye executed with errors:
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2094.codfw.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2094.codfw.wmnet with OS bullseye executed with errors:
Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host elastic2103.codfw.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host elastic2104.codfw.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host elastic2106.codfw.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host elastic2105.codfw.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host elastic2106.codfw.wmnet with OS bullseye completed:
Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host elastic2103.codfw.wmnet with OS bullseye executed with errors:
Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host elastic2104.codfw.wmnet with OS bullseye executed with errors:
Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host elastic2105.codfw.wmnet with OS bullseye executed with errors:
Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host elastic2103.codfw.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host elastic2103.codfw.wmnet with OS bullseye completed:
@bking elastic2088 is now ready for the next step.
elastic2094 is still showing an error and needs further investigation.
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1008.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1008.eqiad.wmnet with OS bullseye completed:
Change 1007969 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] elastic: add elastic2088-2109 to production role
Change 1007969 merged by Bking:
[operations/puppet@production] elastic: add elastic2088-2109 to production role
I added elastic2088-2109 to the production roles and ran puppet, however:
Change 1008528 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] elastic: move elastic2107 back to insetup
Change 1008528 merged by Bking:
[operations/puppet@production] elastic: move elastic2107 and 2108 back to insetup
Change #1013395 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] elastic: Bring elastic2107/2108 into service
Change #1013395 merged by Bking:
[operations/puppet@production] elastic: Bring elastic2107/2108 into service
Change #1013398 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] elastic-codfw: Add new master-eligibles
Change #1013398 merged by Bking:
[operations/puppet@production] elastic-codfw: Add new master-eligibles
Mentioned in SAL (#wikimedia-operations) [2024-03-21T20:35:10Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: introduce new masters - bking@cumin2002 - T353878
Mentioned in SAL (#wikimedia-operations) [2024-03-21T20:37:05Z] <bking@cumin2002> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: introduce new masters - bking@cumin2002 - T353878
Mentioned in SAL (#wikimedia-operations) [2024-03-21T21:03:59Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: introduce new masters - bking@cumin2002 - T353878
Mentioned in SAL (#wikimedia-operations) [2024-03-21T22:39:19Z] <bking@cumin2002> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: introduce new masters - bking@cumin2002 - T353878
elastic2088 is unreachable and reported as missing from PuppetDB by Netbox report. No host should be powered on with puppet disabled or not working for longer period of time. Please either reimage it or shut it down now and reimage it at a later stage (before powering it on).
I think it wasn't logged to this ticket, but we tried kicking off a reimage of elastic2088 yesterday. From SAL:
END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host elastic2088.codfw.wmnet with OS bullseye
Will likely need to open a ticket with dc-ops. For now, I've powered it off through the DRAC via serveraction powerdown.
Mentioned in SAL (#wikimedia-operations) [2024-03-28T19:48:56Z] <ryankemper> T353878 Updated cross cluster remote seed conf with latest master info: ryankemper@mwmaint1002:~/elastic$ python push_cross_cluster_conf.py https://search.svc.codfw.wmnet:9443/_cluster/settings --ccc chi=chi_codfw_masters.lst psi=psi_codfw_masters.lst omega=omega_codfw_masters.lst
Change #1015379 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] elasticsearch: remove elastic2090 from psi cluster
Change #1015381 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):
[operations/puppet@production] elastic: move elastic2088 to insetup
Change #1015381 merged by Bking:
[operations/puppet@production] elastic: move elastic2088 to insetup
Change #1015379 merged by Bking:
[operations/puppet@production] elasticsearch: remove elastic2090 from psi cluster
Mentioned in SAL (#wikimedia-operations) [2024-03-28T20:07:47Z] <ryankemper@cumin2002> START - Cookbook sre.elasticsearch.ban Banning hosts: elastic2090* for ban elastic2090 before reimage - ryankemper@cumin2002 - T353878
Mentioned in SAL (#wikimedia-operations) [2024-03-28T20:07:53Z] <ryankemper@cumin2002> END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic2090* for ban elastic2090 before reimage - ryankemper@cumin2002 - T353878
All hosts in scope for implementation are now part of our production elastic cluster, EXCEPT elastic2088 which has hardware problems (tracked in T361525 ). Closing...
Mentioned in SAL (#wikimedia-operations) [2024-04-12T15:50:06Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.ban Banning hosts: elastic2090 for reboot to get rid of broken systemd units - bking@cumin2002 - T353878
Mentioned in SAL (#wikimedia-operations) [2024-04-12T15:50:10Z] <bking@cumin2002> END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: elastic2090 for reboot to get rid of broken systemd units - bking@cumin2002 - T353878
Mentioned in SAL (#wikimedia-operations) [2024-04-12T15:51:44Z] <bking@cumin2002> START - Cookbook sre.hosts.downtime for 1:00:00 on elastic2090.codfw.wmnet with reason: T353878
Mentioned in SAL (#wikimedia-operations) [2024-04-12T15:51:59Z] <bking@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on elastic2090.codfw.wmnet with reason: T353878