Page MenuHomePhabricator

analytics10[63,67] mgmt interfaces seem flapping from time to time
Closed, ResolvedPublic

Description

Hi folks!

I have seen analytics10[63,67]'s mgmt interfaces flapping in icinga recently, sometimes connectivity gets lost for ~1h. Could you please check if we have faulty cables attached to those?

Event Timeline

@elukey analytics1063 and 1067 idrac's are stuck and each server needs to be physically powered off and unplugged for 20-30 secs

@BTullis can you coordinate with @Cmjohnson to shutdown these nodes?

Yes, I'm happy to shut down these nodes whenever @Cmjohnson prefers.

@BTullis Can you plan to shut this down tomorrow 17 March at 10a EST 1400 UTC.

Yes, will do. Both nodes at the same time?

Shutting down the two servers now. analytics1063 and analytics1067

Mentioned in SAL (#wikimedia-operations) [2022-03-17T14:05:12Z] <btullis@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1063.eqiad.wmnet with reason: T303151

Mentioned in SAL (#wikimedia-operations) [2022-03-17T14:05:16Z] <btullis@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1063.eqiad.wmnet with reason: T303151

Mentioned in SAL (#wikimedia-operations) [2022-03-17T14:05:20Z] <btullis@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1067.eqiad.wmnet with reason: T303151

Mentioned in SAL (#wikimedia-operations) [2022-03-17T14:05:23Z] <btullis@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1067.eqiad.wmnet with reason: T303151

Mentioned in SAL (#wikimedia-analytics) [2022-03-17T14:07:33Z] <btullis> shutdown analytics1063 and analytics1067 with 120 minutes of downtime T303151

BTullis triaged this task as Medium priority.Mar 17 2022, 2:36 PM
BTullis moved this task from Next Up to In Progress on the Data-Engineering-Kanban board.

I am able to get into the idrac for both servers, it does take a little longer than normal. I attempted to replace the cable as well but that did not improve the link speed.

Thanks @Cmjohnson - I guess we'll just keep monitoring for stability and reopen this ticket if it keeps happening.