Hi folks!
I have seen analytics10[63,67]'s mgmt interfaces flapping in icinga recently, sometimes connectivity gets lost for ~1h. Could you please check if we have faulty cables attached to those?
Hi folks!
I have seen analytics10[63,67]'s mgmt interfaces flapping in icinga recently, sometimes connectivity gets lost for ~1h. Could you please check if we have faulty cables attached to those?
@elukey analytics1063 and 1067 idrac's are stuck and each server needs to be physically powered off and unplugged for 20-30 secs
Mentioned in SAL (#wikimedia-operations) [2022-03-17T14:05:12Z] <btullis@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1063.eqiad.wmnet with reason: T303151
Mentioned in SAL (#wikimedia-operations) [2022-03-17T14:05:16Z] <btullis@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1063.eqiad.wmnet with reason: T303151
Mentioned in SAL (#wikimedia-operations) [2022-03-17T14:05:20Z] <btullis@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1067.eqiad.wmnet with reason: T303151
Mentioned in SAL (#wikimedia-operations) [2022-03-17T14:05:23Z] <btullis@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1067.eqiad.wmnet with reason: T303151
Mentioned in SAL (#wikimedia-analytics) [2022-03-17T14:07:33Z] <btullis> shutdown analytics1063 and analytics1067 with 120 minutes of downtime T303151
I am able to get into the idrac for both servers, it does take a little longer than normal. I attempted to replace the cable as well but that did not improve the link speed.
Thanks @Cmjohnson - I guess we'll just keep monitoring for stability and reopen this ticket if it keeps happening.