[x] - Provide FQDN of system.Starting in the past 24 hours, and particularly over the last 12 hours, we have seen incrementing errors on core WMCS link between racks D5 and F4, presenting as inbound errors on //cloudsw1-d5-eqiad et-0/0/53//:
Cable between cloudsw1-d5-eqiad port et-0/0/53 to cloudsw1-f4-eqiad port et-0/0/54
[x] - If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.
[x] - Put system into a failed state in Netbox.
[x] - Provide urgency of request, along with justification (redundancy, dependencies, etc){F55242539}
We have redundant network paths so it's not an ongoing outage, but this reduces the redundancy to 0That port is connected to //cloudsw1-f4-eqiad port et-0/0/54// over single-mode fiber. As usual with these things the likely culprit is a bad optic module at one end or other, high priority though not top priority. Any issue with the other network path would bring all services in cloud services down until repairedbut difficult to say which. Modules are 40G-BaseLR4 (blue handle I think).
[] - Describe issue and/or attach hardware failure log.DC-Ops when possible can we try replacing the module in //cloudsw1-d5-eqiad et-0/0/53// and we can see if there is any improvement? (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help)
[] - Assign correct project tag and appropriate owner (based on above)The link has been drained so we can do this any time. If that does not help we can instead try swapping the far side in F4. Also, pPlease ensure the service owners of the host(s) are added as subscribers to provide any additional input.
ping me on irc when available and we can run some tests see how it looks. Thanks.