Page MenuHomePhabricator

ripe-atlas-codfw is down
Closed, ResolvedPublic

Description

Hi everybody,

the ripe-atlas-codfw anchor is down since 2020-11-10 at around 21 UTC:

https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-target_site=All&var-ip_version=All&var-country_code=All&var-asn=All&from=1605026694423&to=1605056765545

I can see the following on asw-a-codfw:

Nov 10 20:54:44  asw-a-codfw fpc1 [EX-BCM PIC] ex_bcm_linkscan_handler: Link 18 DOWN
Nov 10 20:54:44  asw-a-codfw rpd[1947]: EVENT <UpDown> ge-1/0/4.0 index 587 <Broadcast Multicast> address #0 dc.38.e1.d4.1b.7
Nov 10 20:54:44  asw-a-codfw rpd[1947]: EVENT <UpDown> ge-1/0/4 index 1125 <Broadcast Multicast> address #0 dc.38.e1.d4.1b.7
Nov 10 20:54:44  asw-a-codfw rpd[1947]: STP handler: IFD =NULL, op=change, state=Discarding, Topo change generation=0
Nov 10 20:54:44  asw-a-codfw rpd[1947]: *STP Change*, notify to other modules
Nov 10 20:54:44  asw-a-codfw fpc1 [EX-BCM PIC] ex_bcm_pic_ifd_config: ge-1/0/4, enable - 1
Nov 10 20:54:44  asw-a-codfw mib2d[15883]: SNMP_TRAP_LINK_DOWN: ifIndex 757, ifAdminStatus up(1), ifOperStatus down(2), ifName ge-1/0/4
Nov 10 20:54:44  asw-a-codfw rpd[1947]: STP handler: IFD =NULL, op=change, state=Discarding, Topo change generation=0
Nov 10 20:54:44  asw-a-codfw rpd[1947]: *STP Change*, notify to other modules

As far as I can see from other tasks, this will probably require @Papaul to check onsite (powercycle, cables, etc..)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
elukey triaged this task as High priority.Nov 11 2020, 7:40 AM

power cycle device, checked cable, swapped cable device is still showing down

ayounsi renamed this task from ripe-atlast-codfw is down to ripe-atlas-codfw is down.Nov 16 2020, 4:12 PM
ayounsi assigned this task to faidon.
ayounsi subscribed.

I think Faidon is the person who knows the most about the Atlas :)
Feel free to re-assign as needed.

CDanis added subscribers: faidon, CDanis.

Papaul, could you please attach atlas-codfw to one of the SCS servers so we can take a look via serial console? Thanks!

Connected the device on scs-a1 on port 47 still no connection to serial

Thanks! Can we try you powercycling it while one of us (either you or myself, at your preference) is watching the serial console?

Today we tried powercycling the anchor while I was watching on serial console. It didn't output a thing. As far as I can tell, we need replacement hardware.

Thanks - can you file a procurement request to that effect (& then resolve this task)?

CDanis mentioned this in Unknown Object (Task).

Filed T269046

Can we have a decom task for the faulty device? (switch port is still alerting as being down)

@faidon do we have some documentation on the console configuration for the RIPE?

  • console baud rate
  • Type of cable to use to connect to the console

I tried to use a Cisco console cable and a DB9 to RJ45 adapter no luck on the new RIPE

I'm pretty sure the baud rate is 19200

Not sure about the cable type

I tried both 9600 and 19200 on both cable it didn't work

I believe the Atlas is a PCEngines APU, so you'll need a null modem cable or adapter (RXD->TXD, TXD->RXD, etc.) If this is a Cisco rollover cable, it would do the trick, but your DB9<->RJ45 adapter should not be a crossover adapter, as that would swap crossover twice end-to-end and cancel each other out :)

Baud rate for the BIOS as the system boots is 115200 8n1. Note that unlike our Dells, its BIOS takes all of 2 seconds to boot or something.

I don't know the specifics of the Atlas - did it come with software preinstalled? I'd guess not, and that we'll need to flash it, right? In that case nothing would show up on the console past boot/BIOS.

Thank you for the information. I will try to work on it again when i am back on site tomorrow.

@CDanis
The old device is already set to decom in netbox. let me know when the new device is online so i can offline this device.

Papaul lowered the priority of this task from High to Lowest.Mar 4 2021, 4:18 PM

Unfortunately I won't have time to work on this before going on leave, but it seems like it might not be a bad task for @cmooney to learn some about RIPE Atlas. Or failing that perhaps @ayounsi will have some time for it.

joanna_borun changed the task status from Open to In Progress.Sep 21 2021, 4:01 PM

Hello folks! Not sure if already scheduled but it seems that the current icinga checks for the codfw ripe atlas are getting a 410 gone, do we need to update the ripeatlas_measurements values in hiera? (no idea where to get the values)

The current error is:

UNKNOWN - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with urllib.error.HTTPError: HTTP Error 410: Gone

Change 732252 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Update ripeatlas_measurements for codfw

https://gerrit.wikimedia.org/r/732252

Change 732252 merged by Ayounsi:

[operations/puppet@production] Update ripeatlas_measurements for codfw

https://gerrit.wikimedia.org/r/732252

Hello folks! Not sure if already scheduled but it seems that the current icinga checks for the codfw ripe atlas are getting a 410 gone, do we need to update the ripeatlas_measurements values in hiera? (no idea where to get the values)

The current error is:

UNKNOWN - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with urllib.error.HTTPError: HTTP Error 410: Gone

Thanks fixed.
https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=ripe-atlas-codfw&service=IPv4+ping+to+codfw
https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=ripe-atlas-codfw+IPv6&service=IPv6+ping+to+codfw

I think everything is done here, I'll let @cmooney close the task if so as he did all the hard work!

Cool, thanks @ayounsi. Good insight into how those alerts are configured. I'll know for the next time to update them too :)