Page MenuHomePhabricator

ripe-atlas-eqiad IPv6 unreachable
Closed, ResolvedPublic

Description

See https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=1&host=ripe-atlas-eqiad+IPv6

Started alerting 18h ago. Possibly after an attempt by the RIPE to remotely upgrade it.

.cr1-eqiad> ...:80:155:69 source 2620:0:861:202::1                 
PING6(56=40+8+8 bytes) 2620:0:861:202::1 --> 2620:0:861:202:208:80:155:69
^C
--- 2620:0:861:202:208:80:155:69 ping6 statistics ---
4 packets transmitted, 0 packets received, 100% packet loss

Related Objects

StatusSubtypeAssignedTask
ResolvedCDanis
ResolvedCmjohnson

Event Timeline

ayounsi triaged this task as Medium priority.Jul 15 2020, 6:14 AM
ayounsi created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 15 2020, 6:14 AM

I have heard from the RIPE NCC, they are going to attempt to upgrade our eqiad anchor in place, it may be down for a few days

There are also some diffscan changes, to be checked once everything is done.

CDanis claimed this task.Jul 16 2020, 3:45 PM
CDanis added a subscriber: CDanis.

To give a little more context: in response to us requesting an extension for the v2 anchors, the RIPE NCC team reached out to ask if they can run a test upgrade on our of anchors (which I of course said OK to!).

It seems that while it was successful, the eqiad anchro did not survive a reboot. They're asking for help with console access -- I looped in @CDanis to help them out with that, pulling in other team members (DC Ops for scs connection etc. as necessary). Thanks all :)

According to https://radar.qrator.net portmap is open to the world but I was not able to reproduce.

Jclark-ctr added a subscriber: Jclark-ctr.

Connected console port to scs-c1-eqiad updated netbox with connection

CDanis closed this task as Resolved.Jul 28 2020, 1:10 PM

With the serial console now attached, I found myself in a rescue shell.

I poked around some, got / and /boot mounted under the empty /sysroot, looked at the failed kexec invocation in /proc/cmdline, and modified it to properly specify a root device for the kernel (previously it just listed /dev/mapper/vg01-lv_root instead of root=/dev/mapper/vg01-lv_root):

kexec -l /sysroot/boot/vmlinuz-3.10.0-1127.13.1.el7.x86_64 --initrd=/sysroot/boot/initramfs-3.10.0-1127.13.1.el7.x86_64.img --append="root=/dev/mapper/vg01-lv_root ro crashkernel=auto rd.lvm.lv=vg01/lv_root biosdevname=0 net.ifnames=0 console=ttyS0,19200n8 LANG=en_US.UTF-8 clocksource=tsc fsck.mode=force fsck.repair=yes"
kexec -e

Now the system is back up, showing a login prompt, responding to ipv4 & ipv6 pings again as well as other traffic:

% curl http://ripe-atlas-eqiad.wikimedia.org/256
{"anchor":"us-qas-as14907.anchors.atlas.ripe.net","client":"xxx.x.xxx.xx","payload":"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"}

It doesn't yet show as back online on RIPE's site, but I suspect that's just eventual consistency at work.