analytics10[63,67] mgmt interfaces seem flapping from time to time
Hi folks!

I have seen analytics10[63,67]'s mgmt interfaces flapping in icinga recently, sometimes connectivity gets lost for ~1h. Could you please check if we have faulty cables attached to those?

@elukey analytics1063 and 1067 idrac's are stuck and each server needs to be physically powered off and unplugged for 20-30 secs

@BTullis can you coordinate with @Cmjohnson to shutdown these nodes?

Yes, I'm happy to shut down these nodes whenever @Cmjohnson prefers.

@BTullis Can you plan to shut this down tomorrow 17 March at 10a EST 1400 UTC.

Yes, will do. Both nodes at the same time?

Shutting down the two servers now. analytics1063 and analytics1067

Mentioned in SAL (#wikimedia-operations) [2022-03-17T14:05:12Z] <btullis@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1063.eqiad.wmnet with reason: T303151

Mentioned in SAL (#wikimedia-operations) [2022-03-17T14:05:16Z] <btullis@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1063.eqiad.wmnet with reason: T303151

Mentioned in SAL (#wikimedia-operations) [2022-03-17T14:05:20Z] <btullis@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1067.eqiad.wmnet with reason: T303151

Mentioned in SAL (#wikimedia-operations) [2022-03-17T14:05:23Z] <btullis@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1067.eqiad.wmnet with reason: T303151

Mentioned in SAL (#wikimedia-analytics) [2022-03-17T14:07:33Z] <btullis> shutdown analytics1063 and analytics1067 with 120 minutes of downtime T303151

I am able to get into the idrac for both servers, it does take a little longer than normal. I attempted to replace the cable as well but that did not improve the link speed.

Thanks @Cmjohnson - I guess we'll just keep monitoring for stability and reopen this ticket if it keeps happening.