Thu, May 21
Confirmed that if dns4001 and dns4002 are down, ulsfo will stop advertising 22.214.171.124/24 to the world but still had routes to 126.96.36.199/32 via codfw.
Good idea! What's the limit?
- Comcast - large US ISP - https://atlas.ripe.net/probes/6080/ - https://atlas.ripe.net/probes/6072/
- RIPE to have something to compare esams with - https://atlas.ripe.net/probes/6307/
- LACNIC to have something in South America - https://atlas.ripe.net/probes/6054/
- And on the other coast - https://atlas.ripe.net/probes/6554/
- This one sponsored by APNIC, located in Singapore - https://atlas.ripe.net/probes/6096/
- https://atlas.ripe.net/probes/6380/ and https://atlas.ripe.net/probes/6358/ to have something on both sides of Africa
We could add more regions depending on the "granularity" we want
Wed, May 20
Tue, May 19
Relevant Turnilo where lots of things happened in a short timeframe:
Discussed it with John, so 57.15.185.in-addr.arpa is configured to have ns1/2/3.wikimedia.org as NS. Which is correct.
Thanks! This will also help in case the wrong cable gets bumped into during the new link provisioning.
I looked at the last actions I did yesterday and POP servers links and can't see anything missing, thanks!
Sounds good! This will have to wait for a time we for example do T196487. Outside of COVID times as it's impactful and not urgent.
Mon, May 18
domain: 56.15.185.in-addr.arpa descr: Wikimedia_cloud_eqiad admin-c: FAID1-RIPE admin-c: MBE96-RIPE tech-c: FAID1-RIPE tech-c: MBE96-RIPE tech-c: AY3199-RIPE zone-c: WMF-RIPE nserver: ns0.openstack.eqiad1.wikimediacloud.org nserver: ns1.openstack.eqiad1.wikimediacloud.org mnt-by: WIKIMEDIA-MNT source: RIPE
- Routinator upgraded in T252010. Which helped to remove the "dubious" targets.
- Since this task has been opened, proxies have been moved to new hosts and performance has increase
- Alerting has been tuned to only trigger on HTTP code > 399, as it's not possible to control the repositories we connect to, they will always be a risk of alert.
Not sure if I'm re-opening the proper task, but looks relevant.
For the record:
WARNING - (for 2d 15h 51m 27s) - Status Information: WARN: Long running SCREEN process. (user: root PID: 13601, 1089352s > 864000s).
1089352s = 12 days.
I'd not ping someone about a tmux running for a few hours.
FYI, Prometheus is trying to query netbox2001.wikimedia.org:8443 but there is nothing listening on that port. Which is causing this alert:
Prometheus jobs reduced availability - job=netbox_device_statistics site=codfw
Physically moving the optic to a different port solved the issue.
Opened T252988 to troubleshot that specific issue.
Fri, May 15
Your service was affected by an outage along the transmission path, but the Loss of Signal we saw in Chicago happened after that outage had already started so it is unrelated.
Regarding the Loss of Sync alarm, it is something we see in our Chicago equipment:
This alarm is generated when a Loss Of Sync is detected from the client signal.
This alarm is most likely caused by:
A physically severed fiber between the Trib port and the client equipment
Physically severed fiber between local network element and the upstream network element
A faulty transmitter in the client equipment
When previously tested and again re-tested right now, placing a soft loop in Chicago facing the line, the traffic from San Francisco makes it to Chicago and then loops back, so we start transmitting back to you in San Francisco, so we know the span is good from San Francisco to Chicago. In Chicago, we had previously dispatched our equipment to hard loop test and replace our optic just in case, and I believe this was after you had already replaced your optic. Since that test was also passing, the next step in isolating the issue is if you can try a different port on your equipment, as well as verify all cabling.
If they're dead:
- Either we need them (eg. short on ports), and in that case we need to replace the switch. Which is a heavy operations.
- Or we mark the ports as dead (with a mention of that task), disable them and call it a day.
From Telia after asking them the light levels they're getting.
Looks like we are still at times seeing low light and errors in Chicago and transmitting those to San Francisco:
CHI: Rx -3.25 Tx -3.45
Rx -55.00 @ 11:00 to 11:15 UTC
Rx -55.00 @ 2:15 to 2:30 UTC
Rx PCS: ES for the last week from customer
San Jose: Rx -3.26 Tx -55.00
Tx -55.00 for the last week same as the errors in CHI
Tx errors for last week
Was it on the Chicago side that you changed optics, and can you try a different port there?
Thu, May 14
Unplugging that link caused fpc1 to lose connectivity to the remaining of the VC, while it's neither a VCP, nor enabled.
asw2-d-eqiad fpc1 PFEMAN: Shutting down in 5 seconds, PFEMAN Resync aborted! No peer info on reconnect or master rebooted?
asw2-d-eqiad fpc1 CMLC: Going disconnected; Routing engine chassis socket closed abruptly
Disabled the last link, and the errors are still showing up, so I'm confused on where the issue is coming from.
pic-slot 1 port 3 member 1 was a leftover port configured as VC port, but without any cable connected to it.
Errors are still happening.
I disabled the mentioned link on the fpc2 side (so we don't risk fully losing access to fpc1) first.
Then on the fpc1 side to check if the alert was caused by this DAC.
Wed, May 13
Remote hands replaced the optics yesterday but the link is still down. Lights are correct.
Tue, May 12
I started to look into that:
Yep, see diagram (minus the typo).
Fix is now running in prod.
Grafana alerts have been updated accordingly.
I'd say yes. 1/ and 2/ are done.
VictorOps seems to be a good replacement of the [stretch] as it's possible to page people directly even if the infra is down.
Mon, May 11
The only downside to removing the link fully is that it D1 is 3 hops away D8, which doesn't seem to have been an issue since May 2nd.
Upside is that it brings us closer to a proper cabling diagram.
As 1001 and 2002 are gone this task might be good to close?
Fri, May 8
After ticket 01157098 was resolved, the link didn't come back up.
Ticket 01157707 was opened.
Telia setup a loop on the Chicago side towards SF, which brought the SF interface up, but the Chicago facing loop didn't bring the interface up.
ACKed for 6 more hours the time Telia fixes it.
Added routinator_rtr_current_connections to the Grafana dashboard.
Thu, May 7
Wed, May 6
So far so good, will let it sit until tomorrow before tackling rpki1001.
Note that we only have netflow at our borders, and we sample 1:1000 so it might not be the right tool for now.
Tue, May 5
Cabling diagram, let me know if something is missing or unclear:
Mon, May 4
Same for https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=phab.wmfusercontent.org&service=HTTPS-wmfusercontent
Sat, May 2
Fri, May 1
Thu, Apr 30
Stalling the task until we either:
- can start doing more intrusive testing to see if it works as expected
- msw1-eqiad is replaced with T225121
Thanks. Manual action is better here to prevent flapping.
Wed, Apr 29
Tue, Apr 28
Yes, both PtMP VPLS (displayed as 3 links from site X to provider, and not site X to site Y) and GRE tunnels between sites.
- What do you envision the difference to be between "primary" and "preferred"? (I know you said TBD, but curious :)
TBD, but this is to reflect our current logic exposed in the diagram.
Primary would be the default state. Preferred an override to drain alternate links.