User Details
- User Since
- May 4 2022, 6:41 PM (206 w, 4 d)
- Availability
- Available
- IRC Nick
- brett
- LDAP User
- BCornwall
- MediaWiki User
- BCornwall-WMF [ Global Accounts ]
Wed, Apr 15
I can confirm it's behaving properly now! Reimage worked just fine and I don't have any kernel errors any more. It's been put into service. Thanks, @VRiley-WMF !
Tue, Apr 14
We've decided to move forward with this task. Would dcops be willing to handle the NIC revert in lvs1017?
Mon, Apr 13
Hi, @JKelsoteel-WMF ! This has been deployed - I'm going to go ahead and close this; Please do re-open if something is not as expected. Thanks!
Thu, Apr 9
Thanks for the response, @elukey! Indeed, Icinga would ideally not even be used any more. Since the service in question is planning to be replaced in the upcoming months, it's not worth the porting effort. However, that is a good response: "Why are you using this dead alerting system in the first place? Migrate over to Prometheus/AM".
Wed, Apr 8
This highlights the larger problem of the opacity of cookbooks, particularly those that purport to be generalized. Removing downtimes not related to its own operation is overreaching and IMHO the solution is for it to remove only its own downtime.
Tue, Apr 7
Fri, Apr 3
We're going to be discussing whether we want to pursue this still, sorry for the premature bug report. We'll probably discuss next tuesday in our sync-up.
No problem. I just re-ran it and can confirm that the issues are still present.
Wed, Apr 1
Tue, Mar 31
The LVS service has been remove, the hosts, decommissioned, and the hcaptcha_proxy module removed from puppet. I'm not sure that anything needs to happen with the Grafana dashboard. I see that there's an exclusion of the old hcaptcha hosts in the dashboard variables but otherwise don't see any remnants on first glance.
Mon, Mar 30
@VRiley-WMF Do I need to check things again?
Thu, Mar 26
It's been consistent behavior for some weeks now - both downtimes are removed at once after the reboot occurs. Not sure if this is a cookbook issue or an icinga issue, given the context of this Icinga bug.
Wed, Mar 25
Unsure if this is related to this particular issue but running sre.hosts.downtime and then sre.hosts.reboot-single causes the downtime to be removed, triggering alerts still.
The information on the page was not accurate and has been removed.
Tue, Mar 24
Unfortunately, it's still throwing the errors. :(
@VRiley-WMF Yes, please do!
Marked cp1115 as "failed" in netbox
Hm. The specific output:
Mon, Mar 23
Mar 20 2026
Mar 19 2026
Mar 18 2026
Mar 17 2026
Mar 16 2026
Mar 13 2026
Mar 12 2026
Thank you for all your work, rob. I was able to reimage and all seems well now. I'll re-open this is anything changes.