Page MenuHomePhabricator

WMF RIPE Atlas probe in Eqiad offline
Closed, ResolvedPublic

Description

@tappof made me aware that our RIPE Atlas probe in Eqiad has been unreachable since 14:19 UTC on Dec 16th.

It's not responding to pings at all. Looking at the switch it's connected to the port does show up, however we are not learning any MAC address on the port:

cmooney@asw2-b-eqiad> show interfaces descriptions | match atlas           
ge-4/0/37       up    up   atlas-eqiad

{master:2}
cmooney@asw2-b-eqiad> show ethernet-switching table interface ge-4/0/37.0  

MAC database for interface ge-4/0/37.0

{master:2}
cmooney@asw2-b-eqiad>

I bounced the port remotely to see if it'd kick it into life but still nothing.

DC-Ops folks whenever someone is on site next can they reboot this one? Just power it on and off, it's in rack B4 U45.

Event Timeline

cmooney triaged this task as Low priority.

Icinga downtime and Alertmanager silence (ID=7fe2fd80-b4a4-43f7-ba5a-5238c44bbd7a) set by cmooney@cumin1002 for 30 days, 0:00:00 on 2 host(s) and their services with reason: Atlas device offline, scheduling reboot

ripe-atlas-eqiad,ripe-atlas-eqiad IPv6
VRiley-WMF changed the task status from Open to In Progress.Jan 13 2025, 9:17 PM
VRiley-WMF subscribed.

Rebooting Now

This has been rebooted

@cmooney would you be able to check this when you have a chance?

This has been rebooted

@cmooney would you be able to check this when you have a chance?

Thanks for doing that Valerie. Unfortunately the port is hard down after the reboot so I'm guessing the device is dead.

cmooney@asw2-b-eqiad> show interfaces descriptions | match atlas 
ge-4/0/37       up    down atlas-eqiad

We'll need to make a call on how we move forward (whether to replace with another physical unit or a virtualized one). I'll remove the DC-ops tags and have a think, it may be a little while before I have time to work on it.

Let's decom it and focus our efforts on spinning up VMs instead (T385560).
It needs to be removed from the list on https://github.com/wikimedia/operations-puppet/blob/production/hieradata/common.yaml#L1881 as well as from RIPE's dashboard. Then shut down the switch port, reclaim the IP and hand it over to DCops to be recycled. There are no confidential data on the box so no need for a wipe.

Change #1117154 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Remove eqiad and eqsin ripe atlas from monitoring

https://gerrit.wikimedia.org/r/1117154

Change #1117154 merged by Ayounsi:

[operations/puppet@production] Remove eqiad and eqsin ripe atlas from monitoring

https://gerrit.wikimedia.org/r/1117154

VRiley-WMF claimed this task.

This has been removed and completed