Page MenuHomePhabricator

WMF RIPE Atlas probe in Eqsin offline
Closed, ResolvedPublic

Description

It seems our RIPE Atlas probe in eqsin has been offline since 01:18 UTC on Dec 17th.

Similar to the one in eqiad (see T382518), though I believe it is just a co-incidence. Same symptoms, the port is up but device is unresponsive and we don't see a MAC address on the port:

cmooney@asw1-eqsin> show interfaces descriptions | match atlas             
ge-1/0/22       up    up   atlas-eqsin {#1049}

{master:0}
cmooney@asw1-eqsin> show ethernet-switching table interface ge-1/0/22.0    

MAC database for interface ge-1/0/22.0

{master:0}
cmooney@asw1-eqsin>

Port bounced to no avail. Not sure what the best way forward is, a power cycle for sure should be the first step if we want to schedule that with remote hands. Device is in rack 604 U36.

Event Timeline

cmooney triaged this task as Low priority.

Icinga downtime and Alertmanager silence (ID=68d77968-a0dd-4bd1-94ad-66be8ab508c5) set by cmooney@cumin1002 for 30 days, 0:00:00 on 2 host(s) and their services with reason: Atlas device offline, scheduling reboot

ripe-atlas-eqsin,ripe-atlas-eqsin IPv6

Let's decom it and focus our efforts on spinning up VMs instead (T385560).
It needs to be removed from the list on https://github.com/wikimedia/operations-puppet/blob/production/hieradata/common.yaml#L1881 as well as from RIPE's dashboard. Then shut down the switch port, reclaim the IP and hand it over to DCops to be recycled in the next DC visit. There are no confidential data on the box so no need for a wipe.

Change #1117154 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Remove eqiad and eqsin ripe atlas from monitoring

https://gerrit.wikimedia.org/r/1117154

Change #1117154 merged by Ayounsi:

[operations/puppet@production] Remove eqiad and eqsin ripe atlas from monitoring

https://gerrit.wikimedia.org/r/1117154

The physical anchor has been replaced by a VM, moving that task to DCops to recycle the failed hardware : https://netbox.wikimedia.org/dcim/devices/1287/

The physical anchor has been replaced by a VM, moving that task to DCops to recycle the failed hardware : https://netbox.wikimedia.org/dcim/devices/1287/

Arzhel,

Has the anchor been wiped of any sensitive data so it can simply be unplugged and thrown away or recycled?

The anchor doesn't contain any sensitive data, so yep it can be unplugged and recycled anytime.

Awesome. It likely isn't worth a stand alone ticket for removal and disposal, so instead I'm going to keep this open until I do the following:

  • update notes of object in netbox with link to this task
  • login to the PDUs and hard set the power outlet for the anchor to off

Once we have a hardware recycling round in eqsin (likely not until we order new hardware or need the space), this will then sit and wait (but this task will be resolved once I complete the above two steps.)

notes for device now include: Anchor offline and power port powered down per T382519. No sensitive data on device, can be disposed of in next recycling.

set device to decommissioning in netbox (offline cannot have rack defined)

While it still has the power cord and network cord plugged in, they may as well stay in netbox, with the host as 'decom' it shouldn't generate errors. If it does, I'll adjust how we're keeping things in netbox and have to followup with remote hands to remove cables.

Once we have an actual round of recycling, this will be included.

disabled the single power plug on the PDU tower 2 directly via pdu mgmt.