Page MenuHomePhabricator

ayounsi (Arzhel Younsi)
Network Engineer

Projects (9)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Apr 3 2017, 6:23 PM (164 w, 5 h)
Availability
Available
IRC Nick
xionox
LDAP User
Ayounsi
MediaWiki User
AYounsi (WMF) [ Global Accounts ]

Recent Activity

Thu, May 21

ayounsi closed T253196: Advertise 198.35.27.0/24 as anycast prefix, a subtask of T98006: Anycast AuthDNS, as Resolved.
Thu, May 21, 5:26 PM · Patch-For-Review, Performance-Team (Radar), netops, Operations, Traffic
ayounsi closed T253196: Advertise 198.35.27.0/24 as anycast prefix as Resolved.

Confirmed that if dns4001 and dns4002 are down, ulsfo will stop advertising 198.35.27.0/24 to the world but still had routes to 198.35.27.27/32 via codfw.

Thu, May 21, 5:26 PM · Operations, Traffic, netops
ayounsi updated the task description for T253196: Advertise 198.35.27.0/24 as anycast prefix.
Thu, May 21, 5:24 PM · Operations, Traffic, netops
ayounsi updated the task description for T253196: Advertise 198.35.27.0/24 as anycast prefix.
Thu, May 21, 3:29 PM · Operations, Traffic, netops
ayounsi raised the priority of T243927: Export Netbox Stats for DCops to a visualization tool from Medium to High.
Thu, May 21, 10:03 AM · SRE-tools, netbox, DC-Ops, User-crusnov
ayounsi added a comment to T252890: scrape ripe atlas data for a few anchors at other large networks.

Good idea! What's the limit?
I'd suggest:

We could add more regions depending on the "granularity" we want

Thu, May 21, 9:43 AM · netops, Operations
ayounsi updated the task description for T253196: Advertise 198.35.27.0/24 as anycast prefix.
Thu, May 21, 9:26 AM · Operations, Traffic, netops
ayounsi updated the task description for T253196: Advertise 198.35.27.0/24 as anycast prefix.
Thu, May 21, 8:43 AM · Operations, Traffic, netops
ayounsi updated the task description for T253196: Advertise 198.35.27.0/24 as anycast prefix.
Thu, May 21, 8:12 AM · Operations, Traffic, netops
ayounsi updated the task description for T253196: Advertise 198.35.27.0/24 as anycast prefix.
Thu, May 21, 7:32 AM · Operations, Traffic, netops

Wed, May 20

ayounsi updated the task description for T253196: Advertise 198.35.27.0/24 as anycast prefix.
Wed, May 20, 7:12 PM · Operations, Traffic, netops
ayounsi updated the task description for T253196: Advertise 198.35.27.0/24 as anycast prefix.
Wed, May 20, 6:37 PM · Operations, Traffic, netops
ayounsi updated the task description for T253196: Advertise 198.35.27.0/24 as anycast prefix.
Wed, May 20, 6:29 PM · Operations, Traffic, netops
ayounsi updated the task description for T253196: Advertise 198.35.27.0/24 as anycast prefix.
Wed, May 20, 11:18 AM · Operations, Traffic, netops
ayounsi updated the task description for T253196: Advertise 198.35.27.0/24 as anycast prefix.
Wed, May 20, 10:28 AM · Operations, Traffic, netops
ayounsi added a subtask for T98006: Anycast AuthDNS: T253196: Advertise 198.35.27.0/24 as anycast prefix.
Wed, May 20, 8:32 AM · Patch-For-Review, Performance-Team (Radar), netops, Operations, Traffic
ayounsi added a parent task for T253196: Advertise 198.35.27.0/24 as anycast prefix: T98006: Anycast AuthDNS.
Wed, May 20, 8:31 AM · Operations, Traffic, netops
ayounsi triaged T253196: Advertise 198.35.27.0/24 as anycast prefix as Medium priority.
Wed, May 20, 7:57 AM · Operations, Traffic, netops
ayounsi triaged T253194: Homer CI: verify Junos syntax as Low priority.
Wed, May 20, 6:48 AM · homer

Tue, May 19

ayounsi added a comment to T253128: intermittent brief data dropouts for esams netflow data.

Relevant Turnilo where lots of things happened in a short timeframe:

Tue, May 19, 7:28 PM · Operations, netops
ayounsi added a comment to T247972: Cloud DNS: fix inconsistent ownership of reverse domains for openstack floating ip networks.

Discussed it with John, so 57.15.185.in-addr.arpa is configured to have ns1/2/3.wikimedia.org as NS. Which is correct.

Tue, May 19, 4:28 PM · cloud-services-team (Kanban)
ayounsi closed T253122: Set minimum-links 2 to AMS-IX LACP as Resolved.

Thanks! This will also help in case the wrong cable gets bumped into during the new link provisioning.

Tue, May 19, 2:28 PM · Operations, netops
ayounsi updated subscribers of T253122: Set minimum-links 2 to AMS-IX LACP.
Tue, May 19, 2:11 PM · Operations, netops
ayounsi triaged T253122: Set minimum-links 2 to AMS-IX LACP as High priority.
Tue, May 19, 2:08 PM · Operations, netops
ayounsi closed T253091: Restore POPs server interfaces and cables as Resolved.

I looked at the last actions I did yesterday and POP servers links and can't see anything missing, thanks!

Tue, May 19, 12:02 PM · netbox
ayounsi changed the status of T247881: Three ports on asw2-d-eqiad are not working as expected from Open to Stalled.

Sounds good! This will have to wait for a time we for example do T196487. Outside of COVID times as it's impactful and not urgent.

Tue, May 19, 7:29 AM · ops-eqiad, netops, Operations
ayounsi updated subscribers of T247972: Cloud DNS: fix inconsistent ownership of reverse domains for openstack floating ip networks.

I spent some time digging through the RIPE doc, but can't find any clear answer for T247972#6130041.
@jbond do you have any idea?

Tue, May 19, 7:24 AM · cloud-services-team (Kanban)
ayounsi triaged T253091: Restore POPs server interfaces and cables as High priority.
Tue, May 19, 6:28 AM · netbox

Mon, May 18

ayounsi added a comment to T247972: Cloud DNS: fix inconsistent ownership of reverse domains for openstack floating ip networks.
eqiad
domain:         56.15.185.in-addr.arpa
descr:          Wikimedia_cloud_eqiad
admin-c:        FAID1-RIPE
admin-c:        MBE96-RIPE
tech-c:         FAID1-RIPE
tech-c:         MBE96-RIPE
tech-c:         AY3199-RIPE
zone-c:         WMF-RIPE
nserver:        ns0.openstack.eqiad1.wikimediacloud.org
nserver:        ns1.openstack.eqiad1.wikimediacloud.org
mnt-by:         WIKIMEDIA-MNT
source:         RIPE
Mon, May 18, 5:16 PM · cloud-services-team (Kanban)
ayounsi closed T245121: RRDP status alert as Resolved.
  • Routinator upgraded in T252010. Which helped to remove the "dubious" targets.
  • Since this task has been opened, proxies have been moved to new hosts and performance has increase
  • Alerting has been tuned to only trigger on HTTP code > 399, as it's not possible to control the repositories we connect to, they will always be a risk of alert.
Mon, May 18, 3:05 PM · Operations, netops
ayounsi reopened T224557: Migrate ldap/corp replicas to Stretch/Buster, a subtask of T224549: Track remaining jessie systems in production, as Open.
Mon, May 18, 2:10 PM · Operations
ayounsi reopened T224557: Migrate ldap/corp replicas to Stretch/Buster as "Open".

Not sure if I'm re-opening the proper task, but looks relevant.

Mon, May 18, 2:10 PM · Operations
ayounsi added a comment to T165348: Check long-running screen/tmux sessions.

Today I got pinged by @ayounsi for a WARNING running for a few hours

For the record:

WARNING - (for 2d 15h 51m 27s) - Status Information: WARN: Long running SCREEN process. (user: root PID: 13601, 1089352s > 864000s).

1089352s = 12 days.
I'd not ping someone about a tmux running for a few hours.

Mon, May 18, 8:30 AM · Patch-For-Review, observability, Operations
ayounsi added a comment to T243927: Export Netbox Stats for DCops to a visualization tool.

FYI, Prometheus is trying to query netbox2001.wikimedia.org:8443 but there is nothing listening on that port. Which is causing this alert:
https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=icinga1001&service=Prometheus+jobs+reduced+availability

Prometheus jobs reduced availability - job=netbox_device_statistics site=codfw

Mon, May 18, 8:01 AM · SRE-tools, netbox, DC-Ops, User-crusnov
ayounsi closed T221259: eqord - ulsfo Telia link down - IC-313592 as Resolved.

Physically moving the optic to a different port solved the issue.
Opened T252988 to troubleshot that specific issue.

Mon, May 18, 7:40 AM · ops-eqord, Operations, netops
ayounsi triaged T252988: Faulty port cr2-eqord:xe-0/1/1 as Low priority.
Mon, May 18, 7:39 AM · Operations, netops

Fri, May 15

ayounsi added a comment to T221259: eqord - ulsfo Telia link down - IC-313592.

From Telia:

Your service was affected by an outage along the transmission path, but the Loss of Signal we saw in Chicago happened after that outage had already started so it is unrelated.
Regarding the Loss of Sync alarm, it is something we see in our Chicago equipment:
This alarm is generated when a Loss Of Sync is detected from the client signal.
This alarm is most likely caused by:
A physically severed fiber between the Trib port and the client equipment
Physically severed fiber between local network element and the upstream network element
A faulty transmitter in the client equipment
When previously tested and again re-tested right now, placing a soft loop in Chicago facing the line, the traffic from San Francisco makes it to Chicago and then loops back, so we start transmitting back to you in San Francisco, so we know the span is good from San Francisco to Chicago. In Chicago, we had previously dispatched our equipment to hard loop test and replace our optic just in case, and I believe this was after you had already replaced your optic. Since that test was also passing, the next step in isolating the issue is if you can try a different port on your equipment, as well as verify all cabling.

Fri, May 15, 5:26 PM · ops-eqord, Operations, netops
ayounsi added a comment to T247881: Three ports on asw2-d-eqiad are not working as expected.

If they're dead:

  • Either we need them (eg. short on ports), and in that case we need to replace the switch. Which is a heavy operations.
  • Or we mark the ports as dead (with a mention of that task), disable them and call it a day.
Fri, May 15, 11:10 AM · ops-eqiad, netops, Operations
ayounsi updated subscribers of T221259: eqord - ulsfo Telia link down - IC-313592.

From Telia after asking them the light levels they're getting.

Looks like we are still at times seeing low light and errors in Chicago and transmitting those to San Francisco:
CHI: Rx -3.25 Tx -3.45
Rx -55.00 @ 11:00 to 11:15 UTC
Rx -55.00 @ 2:15 to 2:30 UTC
Rx PCS: ES for the last week from customer
San Jose: Rx -3.26
Tx -55.00
Tx -55.00 for the last week same as the errors in CHI
Tx errors for last week
Was it on the Chicago side that you changed optics, and can you try a different port there?

Fri, May 15, 9:33 AM · ops-eqord, Operations, netops

Thu, May 14

ayounsi added a comment to T252797: asw2-d1-eqiad:VCP failure.

Unplugging that link caused fpc1 to lose connectivity to the remaining of the VC, while it's neither a VCP, nor enabled.

asw2-d-eqiad fpc1 PFEMAN: Shutting down in 5 seconds, PFEMAN Resync aborted! No peer info on reconnect or master rebooted?
asw2-d-eqiad fpc1 CMLC: Going disconnected; Routing engine chassis socket closed abruptly

Thu, May 14, 5:54 PM · Operations, netops, ops-eqiad
ayounsi added a comment to T252797: asw2-d1-eqiad:VCP failure.

From T218059#5075466 it probably due to the link disabled in T251663 acting up.

Thu, May 14, 5:20 PM · Operations, netops, ops-eqiad
ayounsi added a comment to T252797: asw2-d1-eqiad:VCP failure.

Disabled the last link, and the errors are still showing up, so I'm confused on where the issue is coming from.

Thu, May 14, 5:14 PM · Operations, netops, ops-eqiad
ayounsi added a comment to T252797: asw2-d1-eqiad:VCP failure.

pic-slot 1 port 3 member 1 was a leftover port configured as VC port, but without any cable connected to it.
Errors are still happening.

Thu, May 14, 5:00 PM · Operations, netops, ops-eqiad
ayounsi added a comment to T252797: asw2-d1-eqiad:VCP failure.

I disabled the mentioned link on the fpc2 side (so we don't risk fully losing access to fpc1) first.
Then on the fpc1 side to check if the alert was caused by this DAC.

Thu, May 14, 4:48 PM · Operations, netops, ops-eqiad
ayounsi triaged T252797: asw2-d1-eqiad:VCP failure as High priority.
Thu, May 14, 4:33 PM · Operations, netops, ops-eqiad
ayounsi triaged T252747: Generate ssh_known_hosts for network devices as Medium priority.
Thu, May 14, 8:02 AM · SRE-tools, Operations

Wed, May 13

ayounsi updated the task description for T252630: LibreNMS monitoring glitch caused paging.
Wed, May 13, 9:40 AM · Operations, observability, netops
ayounsi triaged T252631: Upgrade Junos on asw2-esams as Low priority.
Wed, May 13, 9:39 AM · netops, Operations
ayounsi triaged T252630: LibreNMS monitoring glitch caused paging as Medium priority.
Wed, May 13, 9:37 AM · Operations, observability, netops
ayounsi added a comment to T221259: eqord - ulsfo Telia link down - IC-313592.

Remote hands replaced the optics yesterday but the link is still down. Lights are correct.

Wed, May 13, 7:50 AM · ops-eqord, Operations, netops

Tue, May 12

ayounsi added a comment to T247972: Cloud DNS: fix inconsistent ownership of reverse domains for openstack floating ip networks.

I started to look into that:

Tue, May 12, 4:33 PM · cloud-services-team (Kanban)
ayounsi added a comment to T251632: (Need By: TBD) rack/setup/install WMCS 10G switches.

Yep, see diagram (minus the typo).
cloudsw-c8-eqiad
cloudsw-d5-eqiad

Tue, May 12, 3:23 PM · cloud-services-team (Hardware), Operations, netops, ops-eqiad, DC-Ops
ayounsi closed T240817: Routinator RSYNC errors as Resolved.

Fix is now running in prod.
Grafana alerts have been updated accordingly.

Tue, May 12, 2:42 PM · Operations, netops
ayounsi committed rOHMP4d317c82ed15: Add blackhole and trusted_space (authored by ayounsi).
Add blackhole and trusted_space
Tue, May 12, 8:43 AM
ayounsi closed T229782: SRE firefighting improvements - 2019-20 Q1 Goal as Resolved.

I'd say yes. 1/ and 2/ are done.
VictorOps seems to be a good replacement of the [stretch] as it's possible to page people directly even if the infra is down.

Tue, May 12, 6:57 AM · SRE-OnFire, Goal, Operations

Mon, May 11

ayounsi committed rOHPU42199290d5aa: Generate blackhole prefix-list from private list (authored by ayounsi).
Generate blackhole prefix-list from private list
Mon, May 11, 9:52 PM
ayounsi updated subscribers of T251663: D1<->D8 VC link failure.

The only downside to removing the link fully is that it D1 is 3 hops away D8, which doesn't seem to have been an issue since May 2nd.
Upside is that it brings us closer to a proper cabling diagram.

Mon, May 11, 2:50 PM · Sustainability (Incident Prevention), Operations, netops
ayounsi updated the task description for T196487: upgrade row d to have 3 10G switches.
Mon, May 11, 2:05 PM · ops-eqiad, netops, Operations
ayounsi added a comment to T211850: install2002 94% disk usage on "/".

As 1001 and 2002 are gone this task might be good to close?

Mon, May 11, 12:42 PM · Operations

Fri, May 8

ayounsi closed T250405: Netbox test_blank_cable_label not working as Resolved.
Fri, May 8, 6:09 PM · User-crusnov, netbox
ayounsi claimed T250405: Netbox test_blank_cable_label not working.
Fri, May 8, 5:44 PM · User-crusnov, netbox
ayounsi placed T221259: eqord - ulsfo Telia link down - IC-313592 up for grabs.
Fri, May 8, 10:22 AM · ops-eqord, Operations, netops
ayounsi added a comment to T221259: eqord - ulsfo Telia link down - IC-313592.

After ticket 01157098 was resolved, the link didn't come back up.
Ticket 01157707 was opened.
Telia setup a loop on the Chicago side towards SF, which brought the SF interface up, but the Chicago facing loop didn't bring the interface up.

Fri, May 8, 9:55 AM · ops-eqord, Operations, netops
ayounsi added a comment to T221259: eqord - ulsfo Telia link down - IC-313592.

ACKed for 6 more hours the time Telia fixes it.

Fri, May 8, 8:16 AM · ops-eqord, Operations, netops
ayounsi closed T252010: Upgrade Routinator 3000 to 0.7.0 as Resolved.

Added routinator_rtr_current_connections to the Grafana dashboard.

Fri, May 8, 8:13 AM · Operations, netops

Thu, May 7

ayounsi committed rOSNE51c6591313d6: Remove Juniper report (authored by ayounsi).
Remove Juniper report
Thu, May 7, 9:38 AM

Wed, May 6

ayounsi added a comment to T252010: Upgrade Routinator 3000 to 0.7.0.

So far so good, will let it sit until tomorrow before tackling rpki1001.

Wed, May 6, 1:39 PM · Operations, netops
ayounsi triaged T252010: Upgrade Routinator 3000 to 0.7.0 as Low priority.
Wed, May 6, 1:21 PM · Operations, netops
ayounsi added a comment to T249022: Track and list the services that Cloud Services that connect to internal network endpoints.

Note that we only have netflow at our borders, and we sample 1:1000 so it might not be the right tool for now.

Wed, May 6, 11:29 AM · cloud-services-team (Kanban)
ayounsi created T252002: How to handle Icinga disabled notifications?.
Wed, May 6, 11:09 AM · observability

Tue, May 5

ayounsi closed Restricted Task, a subtask of T245755: Install superset on front end server for analytics, as Resolved.
Tue, May 5, 12:39 PM · Patch-For-Review, Analytics, fundraising-tech-ops, Fundraising-Backlog
ayounsi added a comment to T251632: (Need By: TBD) rack/setup/install WMCS 10G switches.

Cabling diagram, let me know if something is missing or unclear:

Tue, May 5, 11:45 AM · cloud-services-team (Hardware), Operations, netops, ops-eqiad, DC-Ops

Mon, May 4

ayounsi closed Restricted Task, a subtask of T245755: Install superset on front end server for analytics, as Resolved.
Mon, May 4, 2:21 PM · Patch-For-Review, Analytics, fundraising-tech-ops, Fundraising-Backlog
ayounsi closed T251767: Flowspec controller PoC as Resolved.
Mon, May 4, 1:46 PM · Operations, netops
ayounsi added a project to T251663: D1<->D8 VC link failure: Sustainability (Incident Prevention).
Mon, May 4, 10:37 AM · Sustainability (Incident Prevention), Operations, netops
ayounsi created T251729: 503 error in Netbox accounting report cause systemd alert.
Mon, May 4, 8:07 AM · netbox
ayounsi created T251728: snapshot of s3 in eqiad critical.
Mon, May 4, 7:49 AM · Operations
ayounsi created T251727: Maps - OSM synchronization lag - eqiad.
Mon, May 4, 7:41 AM · Operations, Maps
ayounsi added a comment to T251726: Certificate *.wikipedia.org valid until 2020-06-20.

Same for https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=phab.wmfusercontent.org&service=HTTPS-wmfusercontent
phab.wmfusercontent.org

Mon, May 4, 7:37 AM · Patch-For-Review, Traffic, serviceops, Operations
ayounsi created T251726: Certificate *.wikipedia.org valid until 2020-06-20.
Mon, May 4, 7:36 AM · Patch-For-Review, Traffic, serviceops, Operations
ayounsi created T251725: Netbox report PuppetDB PhysicalHosts critical.
Mon, May 4, 7:30 AM · Operations, ops-eqiad

Sat, May 2

ayounsi updated the task description for T251663: D1<->D8 VC link failure.
Sat, May 2, 7:42 AM · Sustainability (Incident Prevention), Operations, netops
ayounsi renamed T251663: D1<->D8 VC link failure from D1<->D8 link failure to D1<->D8 VC link failure.
Sat, May 2, 7:39 AM · Sustainability (Incident Prevention), Operations, netops
ayounsi merged T251601: Investigate D1 appservers<->memcache TKOs into T251663: D1<->D8 VC link failure.
Sat, May 2, 7:38 AM · Sustainability (Incident Prevention), Operations, netops
ayounsi merged task T251601: Investigate D1 appservers<->memcache TKOs into T251663: D1<->D8 VC link failure.
Sat, May 2, 7:38 AM · serviceops, netops, Operations
ayounsi triaged T251663: D1<->D8 VC link failure as High priority.
Sat, May 2, 7:38 AM · Sustainability (Incident Prevention), Operations, netops

Fri, May 1

ayounsi updated the task description for T251601: Investigate D1 appservers<->memcache TKOs.
Fri, May 1, 12:55 PM · serviceops, netops, Operations
ayounsi updated the task description for T251601: Investigate D1 appservers<->memcache TKOs.
Fri, May 1, 12:16 PM · serviceops, netops, Operations
ayounsi triaged T251601: Investigate D1 appservers<->memcache TKOs as High priority.
Fri, May 1, 12:14 PM · serviceops, netops, Operations

Thu, Apr 30

ayounsi changed the status of T245192: Investigate Juniper storm control from Open to Stalled.

Stalling the task until we either:

  • can start doing more intrusive testing to see if it works as expected
  • msw1-eqiad is replaced with T225121
Thu, Apr 30, 3:26 PM · Operations, Wikimedia-Incident, netops
ayounsi added a comment to T245192: Investigate Juniper storm control.

Thanks. Manual action is better here to prevent flapping.

Thu, Apr 30, 7:13 AM · Operations, Wikimedia-Incident, netops

Wed, Apr 29

ayounsi triaged T251373: Restbase/cassandra SSL certs expiration (2020-06-24) as Medium priority.
Wed, Apr 29, 8:35 AM · RESTBase-Cassandra
ayounsi added a member for SRE-OnFire-Incident-Docs: ayounsi.
Wed, Apr 29, 7:41 AM

Tue, Apr 28

ayounsi committed rOSHO03430b773935: Python 3.8 support (authored by ayounsi).
Python 3.8 support
Tue, Apr 28, 12:00 PM
ayounsi closed Restricted Task, a subtask of T247073: Configure management-instance on routers with Junos > 17.3, as Resolved.
Tue, Apr 28, 8:22 AM · Patch-For-Review, Operations, netops
ayounsi updated the task description for T205897: Netbox: fill network topology.
Tue, Apr 28, 8:20 AM · netbox, Operations
ayounsi added a comment to T200277: OSPF metrics.

Interesting idea! Couple of notes:

  • What do you mean by "virtual links" and Netbox not supporting them? Is that VLANs for our transports over the PtMP VPLS?

Yes, both PtMP VPLS (displayed as 3 links from site X to provider, and not site X to site Y) and GRE tunnels between sites.

  • What do you envision the difference to be between "primary" and "preferred"? (I know you said TBD, but curious :)

TBD, but this is to reflect our current logic exposed in the diagram.
Primary would be the default state. Preferred an override to drain alternate links.

Tue, Apr 28, 8:04 AM · Operations, netops
ayounsi triaged T251222: Upgrade LibreNMS to 1.63 as Low priority.
Tue, Apr 28, 6:57 AM · User-fgiunchedi, observability, Operations, netops

Mon, Apr 27

ayounsi triaged T251184: Add Grafana worldmap panel as Low priority.
Mon, Apr 27, 7:09 PM · observability
ayounsi updated the task description for T225140: Icinga alerts that should open tasks instead of alerting.
Mon, Apr 27, 4:52 PM · observability