Page MenuHomePhabricator

ayounsi (Arzhel Younsi)
Network Engineer

Projects (6)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Apr 3 2017, 6:23 PM (123 w, 5 d)
Availability
Available
IRC Nick
xionox
LDAP User
Ayounsi
MediaWiki User
AYounsi (WMF) [ Global Accounts ]

Recent Activity

Fri, Aug 16

ayounsi added a comment to T230600: Investigate the potential benefits of BGPalerter .

Indeed, should replace bgpmon.net (going EoL soon).

Fri, Aug 16, 3:48 PM · netops, Operations

Thu, Aug 15

ayounsi reopened T227808: Standardize cross confederation BGP policies, a subtask of T167841: Cleanup confed BGP peerings and policies, as Open.
Thu, Aug 15, 10:42 PM · Operations, netops
ayounsi reopened T227808: Standardize cross confederation BGP policies as "Open".

Reopening as the cleanup above is only part of the solution. It was made with the idea that it would be ok for all sites to route to any other site, while as explained by Brandon in T228190#5414366 it's better to influence that routing.

Thu, Aug 15, 10:42 PM · Operations, netops
ayounsi updated the task description for T226422: update RE-S-X6-64G-S in cr[12]-codfw.
Thu, Aug 15, 8:58 PM · netops, Operations, ops-codfw
ayounsi updated the task description for T167841: Cleanup confed BGP peerings and policies.
Thu, Aug 15, 6:51 PM · Operations, netops

Wed, Aug 14

ayounsi added a comment to T167841: Cleanup confed BGP peerings and policies.

Sounds good, final version, including both AS 65002 and AS 65001 as optional to keep it generic.
Tested the regex using show route aspath-regex "^(65002|65001)? 64600.*"
Will push IPv6 first, then 24h later IPv4 if everything is fine.

[edit routing-options rib inet6.0]
+    aggregate {
+        route 2620:0:860::/46 policy BGP_from_local_LVS;
+    }
[edit routing-options aggregate]
+    route 208.80.152.0/22 policy BGP_from_local_LVS;
[edit policy-options]
+   policy-statement BGP_from_local_LVS {
+       term BGP_local_LVS {
+           from {
+               protocol bgp;
+               as-path "^(65002|65001)? 64600.*";
+           }
+           then accept;
+       }
+       then reject;
+   }
Wed, Aug 14, 9:23 PM · Operations, netops
ayounsi updated the task description for T167841: Cleanup confed BGP peerings and policies.
Wed, Aug 14, 9:13 PM · Operations, netops
ayounsi added a subtask for T167841: Cleanup confed BGP peerings and policies: T167306: ospf link-protection.
Wed, Aug 14, 9:12 PM · Operations, netops
ayounsi added a parent task for T167306: ospf link-protection: T167841: Cleanup confed BGP peerings and policies.
Wed, Aug 14, 9:12 PM · Operations, netops
ayounsi added a comment to T228827: Instability of the Level3 link between cr2-eqiad and cr2-esams.

Circuit is down again, opened ticket 16915334.
Account rep replied to the thread and put their client support manager in the loop as well.

Wed, Aug 14, 7:10 PM · Operations, netops
ayounsi added a comment to T229682: Add more dimensions to netflow's druid ingestion specs.

Looks good to me! Some data is missing but it seems to be an issue on the exporter side.

Wed, Aug 14, 4:07 PM · Analytics-Kanban, Analytics

Tue, Aug 13

ayounsi renamed T230448: Aug 28th: turn off 1/3 esams-knams lasers in advance of Relined PA-988002 maintenance from Aug 28th: turn off knams lasers & stop advertising prefixes in advance of Relined PA-988002 maintenance to Aug 28th: turn off 1/3 esams-knams lasers in advance of Relined PA-988002 maintenance.
Tue, Aug 13, 8:46 PM · netops, Traffic, Operations
ayounsi added a comment to T230448: Aug 28th: turn off 1/3 esams-knams lasers in advance of Relined PA-988002 maintenance.

That seems to actually be one circuit terminating in two ports on each sides:
cr2-knams:xe-0/0/3 to asw-esams:xe-0/0/32

Tue, Aug 13, 8:41 PM · netops, Traffic, Operations

Mon, Aug 12

ayounsi added a comment to T228827: Instability of the Level3 link between cr2-eqiad and cr2-esams.

Email sent to our account rep to know what they can do.

Mon, Aug 12, 5:39 PM · Operations, netops

Fri, Aug 9

ayounsi added a comment to T167841: Cleanup confed BGP peerings and policies.

For the eqord issue, this should works.
The`208.80.152.0/22` prefix gets created only if the router has (or learn) at least one contributing prefix (including in the /22) with a next-hop (ignores directly connected).
On top of that we use the new policy BGP_from_local_LVS to only accept (thus consider as contributing) prefixes learned from BGP, with an as path starting with 64600, which only match the LVS in the same confederation.

Fri, Aug 9, 9:59 PM · Operations, netops

Thu, Aug 8

ayounsi claimed T229755: csw2-esams's VCP link flapped.
Thu, Aug 8, 9:28 PM · Operations, netops
ayounsi added a comment to T229755: csw2-esams's VCP link flapped.

Seems like this device is seeing its end coming with the esams refresh.

Thu, Aug 8, 9:28 PM · Operations, netops
ayounsi added a comment to T229998: decom cookbook: dry-run mode not working / PuppetDB and Debmonitor removals can fail.

I ACKed the Netbox/PuppetDB alert (missing VM from Netbox: poolcounter2001) linking to that task.

Thu, Aug 8, 5:57 PM · Operations
ayounsi closed T228824: Add VCP stats monitoring as Resolved.

We now have visibility on all VCPs;
https://librenms.wikimedia.org/ports/ifType=vcp/format=list_basic/
They also benefit from the same alerting as regular ports for saturation and errors.

Thu, Aug 8, 4:12 PM · observability, netops, Operations
ayounsi added a comment to T229682: Add more dimensions to netflow's druid ingestion specs.

Nullifying them is fine. Depending on how costly they are, we could consider getting rid of other fields as well after 90 days (and aggregating the data through the remaining fields). I'm thinking of tcp_flags, ip_proto, peer_as and port, obviously the more we keep, the better for us, unless it impacts performances :)

Thu, Aug 8, 2:22 PM · Analytics-Kanban, Analytics

Wed, Aug 7

ayounsi added a comment to T229682: Add more dimensions to netflow's druid ingestion specs.
  • event_type (is this needed @ayounsi ?)

Not needed.

Wed, Aug 7, 11:45 PM · Analytics-Kanban, Analytics
ayounsi added a comment to T228824: Add VCP stats monitoring.

This is working!
Why is that behind a configuration options and not enabled by default? I have no idea.
Will let those two sit overnight and roll it to the whole fleet if all good.

Wed, Aug 7, 11:28 PM · observability, netops, Operations
ayounsi closed T221156: cr4-ulsfo rebooted unexpectedly as Resolved.

According to engineering there is no much information that can be provided from the crash as the issue thread do not have any information and is blank.
This is was not reproducible in the lab so the reason for the issue was untraceable.
Please let me know if we saw any new crashes or issues on the box, so that we can investigate with more information.

Wed, Aug 7, 11:23 PM · Operations, netops

Fri, Aug 2

ayounsi added a comment to T228827: Instability of the Level3 link between cr2-eqiad and cr2-esams.

Talked to Faidon, using the backup link for a long amount of time is costing us money (see overusage on https://librenms.wikimedia.org/bill/bill_id=17/). I made the Level3 link primary again.

Fri, Aug 2, 11:20 PM · Operations, netops
ayounsi closed T212011: migrate netinsights from rhenium to sulfur, a subtask of T201364: rack/setup/install sulfur.wikimedia.org, as Resolved.
Fri, Aug 2, 4:51 PM · ops-eqiad, Operations
ayounsi closed T212011: migrate netinsights from rhenium to sulfur as Resolved.

We created a VM (netflow1001) to replace rhenium, everything has been migrated.

Fri, Aug 2, 4:51 PM · netops, Operations
ayounsi closed T224477: rhenium [spare] server still receiving flow data, a subtask of T212011: migrate netinsights from rhenium to sulfur, as Resolved.
Fri, Aug 2, 4:50 PM · netops, Operations
ayounsi closed T224477: rhenium [spare] server still receiving flow data as Resolved.

This also caused the BGP sessions between rhenium (netflow) and the routers to alert.

Fri, Aug 2, 4:50 PM · Traffic, Operations
ayounsi reassigned T225121: (Need By: Sept 30) upgrade msw1-eqiad from EX4200 to EX4300 from ayounsi to Papaul.

codfw is done. @Papaul let me know if you need help to prepare the eqiad one.

Fri, Aug 2, 3:36 PM · netops, ops-eqiad, Operations

Thu, Aug 1

ayounsi added a comment to T228824: Add VCP stats monitoring.

Service Request ID 2019-0801-0611 has been created.

Thu, Aug 1, 8:03 PM · observability, netops, Operations
ayounsi updated the task description for T224250: Setup new msw1-codfw.
Thu, Aug 1, 7:57 PM · ops-codfw, netops, Operations
ayounsi closed T228112: Cable mr1-codfw<->cr1/2-codfw through asw-a-codfw, a subtask of T224250: Setup new msw1-codfw, as Resolved.
Thu, Aug 1, 7:57 PM · ops-codfw, netops, Operations
ayounsi closed T228112: Cable mr1-codfw<->cr1/2-codfw through asw-a-codfw as Resolved.

This is done.

Thu, Aug 1, 7:57 PM · Operations, netops, ops-codfw
ayounsi added a comment to T102099: Fix IPv6 autoconf issues once and for all, across the fleet..

This should probably wait on T219908. Whatever solution we find to configure IPv4 based on Netbox data, IPv6 should be the same.

Thu, Aug 1, 7:46 PM · Patch-For-Review, Traffic, netops, Operations, IPv6
ayounsi triaged T229612: asw2-c-eqiad:xe-2/0/45 inbound interface errors as Normal priority.
Thu, Aug 1, 7:37 PM · netops, ops-eqiad, Operations
fgiunchedi awarded T229542: Export LibreNMS data to Prometheus a Like token.
Thu, Aug 1, 3:46 PM · observability
ayounsi triaged T229542: Export LibreNMS data to Prometheus as Low priority.
Thu, Aug 1, 4:36 AM · observability
ayounsi added a comment to T228824: Add VCP stats monitoring.

Good news, this is already implemented with: https://github.com/librenms/librenms/pull/9879

Thu, Aug 1, 4:26 AM · observability, netops, Operations

Wed, Jul 31

ayounsi added a comment to T228827: Instability of the Level3 link between cr2-eqiad and cr2-esams.

This circuit has been impacted by multiple planned maintenances and higher-level network events. They have all been different troubles that have been restored so there are no chronic issues impacting this service. In the future, please report all troubles so we can fully investigate while the logs and performance monitoring data is still available.

Wed, Jul 31, 10:35 PM · Operations, netops
ayounsi added a comment to T225314: Load Netflow to Druid.

Thanks ! I enabled it and added more dimensions, please let us know if there is any issue.

Wed, Jul 31, 10:13 PM · Analytics-Kanban, Analytics
ayounsi added a comment to T226422: update RE-S-X6-64G-S in cr[12]-codfw.

Also noticed the following while looking at the doc again today:

  1. Use the request chassis routing-engine master switch command to make the Routing Engine RE-S-X6-64G (RE1) the master Routing Engine. All FPCs reboot after this step.
Wed, Jul 31, 8:38 PM · netops, Operations, ops-codfw
ayounsi added a comment to T226422: update RE-S-X6-64G-S in cr[12]-codfw.

First JTAC suggestion is to re-seat the SCB. We didn't do that today as the doc wasn't clear if it could be done with the router online.
JTAC is looking into the logs.

Wed, Jul 31, 8:34 PM · netops, Operations, ops-codfw
ayounsi added a comment to T226422: update RE-S-X6-64G-S in cr[12]-codfw.

The new backup routing engine is not coming online.
Rolling back to the old one is not working neither.
Opened JTAC Service Request ID: 2019-0731-0446 .

Wed, Jul 31, 4:22 PM · netops, Operations, ops-codfw
ayounsi added a comment to T228275: Use centrallog1001 for network devices syslog.

I reconfigured those 3 to use syslog.anycast.wmnet

Wed, Jul 31, 1:52 PM · netops, User-fgiunchedi, Operations

Tue, Jul 30

ayounsi merged T229328: ps1 eqiad Icinga UNKNOWNs into T229101: Phase monitoring for new PDUs.
Tue, Jul 30, 7:38 PM · observability, DC-Ops, Operations
ayounsi merged task T229328: ps1 eqiad Icinga UNKNOWNs into T229101: Phase monitoring for new PDUs.
Tue, Jul 30, 7:38 PM · Operations, ops-eqiad, DC-Ops
ayounsi added a comment to T229328: ps1 eqiad Icinga UNKNOWNs.
icinga1001:~$ /usr/lib/nagios/plugins/check_snmp -H 10.65.0.34 -o .1.3.6.1.4.1.1718.3.2.2.1.7.1.1 -C <secret>
External command error: Error in packet
Reason: (noSuchName) There is no such variable name in this MIB.
Failed object: iso.3.6.1.4.1.1718.3.2.2.1.7.1.1
Tue, Jul 30, 7:37 PM · Operations, ops-eqiad, DC-Ops
ayounsi closed T225296: High Prometheus TCP retransmits as Resolved.

Looking at it again, it was a missing port in the router's firewall terms.

Tue, Jul 30, 4:51 PM · User-Elukey, User-fgiunchedi, Cloud-Services, observability, Analytics
ayounsi reopened T225296: High Prometheus TCP retransmits as "Open".

Thanks for tackling the analytics part, the Cloud one is still an issue:

Unrelated, the (some?) coudvirt hosts have the prometheus rsyslog exporter listening on port 9105.
tcp6 0 0 :::9105 :::* LISTEN 33652/prometheus-rs
but it can't be queried from prometheus1004
eg.
prometheus1004:~$ curl -v cloudvirt1015.eqiad.wmnet:9105/metrics hangs
While the other exporter listening on 9100 replies fine.
As Prometheus is configured to query that endpoint, it tries, retries, and fails.

Tue, Jul 30, 4:37 PM · User-Elukey, User-fgiunchedi, Cloud-Services, observability, Analytics
ayounsi closed T227808: Standardize cross confederation BGP policies, a subtask of T167841: Cleanup confed BGP peerings and policies, as Resolved.
Tue, Jul 30, 12:19 AM · Operations, netops
ayounsi closed T227808: Standardize cross confederation BGP policies as Resolved.

This is done and pushed to all the sites.

Tue, Jul 30, 12:19 AM · Operations, netops

Mon, Jul 29

ayounsi claimed T228827: Instability of the Level3 link between cr2-eqiad and cr2-esams.
Mon, Jul 29, 6:22 PM · Operations, netops
ayounsi added a comment to T228827: Instability of the Level3 link between cr2-eqiad and cr2-esams.

We can see that link flapping in https://librenms.wikimedia.org/device/device=2/tab=port/port=6835/view=events/ as well. I think only one of those was a planned maintenance.
I called CentruryLink, and they opened ticket 16814863 to investigate it and allowed them to do intrusive testing.
I drained that Level3 link as well by tuning OSPF metrics so their testing doesn't impact users.

Mon, Jul 29, 6:22 PM · Operations, netops

Fri, Jul 26

ayounsi closed T224223: decommission lvs100[123456].wikimedia.org as Resolved.

lvs100[1-6] removed from switches.

Fri, Jul 26, 6:45 PM · Traffic, Operations, DC-Ops

Thu, Jul 25

ayounsi added a comment to T227808: Standardize cross confederation BGP policies.

Confirmed working as expected, eg. esams still show the customer prefixes, plus now BGP advertised prefixes (LVS/Anycast).
Will let it sit before rolling out to all sites.

Thu, Jul 25, 10:08 PM · Operations, netops
ayounsi added a comment to T224188: rack/setup/install (3) new osd ceph nodes.

Note that there are 38 servers using SFP-Ts, which mean using 1G on a 10G switch.

asw2-b-eqiad> show chassis hardware | match SFP-T | count 
Count: 38 lines

Ideally those should be the first ones to move out.

Thu, Jul 25, 7:52 PM · ops-eqiad, Operations, cloud-services-team (Kanban), Cloud-Services
ayounsi closed T228617: AS63541's session down reported by cr1-eqsin as Resolved.

Peer removed, cf. emails to peering@

Thu, Jul 25, 2:06 AM · netops, Operations
ayounsi closed T225108: Prometheus logs showing errors for routinator , a subtask of T220669: RPKI Validation, as Resolved.
Thu, Jul 25, 2:05 AM · Operations, netops
ayounsi closed T225108: Prometheus logs showing errors for routinator as Resolved.

Fixed with the latest upgrade of Routinator

Thu, Jul 25, 2:05 AM · observability, netops, Operations

Wed, Jul 24

ayounsi added a comment to T228112: Cable mr1-codfw<->cr1/2-codfw through asw-a-codfw.

Scheduled for the 30st at 15:00UTC (1h total). Let me know if it needs to be rescheduled.

Wed, Jul 24, 8:35 PM · Operations, netops, ops-codfw
ayounsi added a comment to T226422: update RE-S-X6-64G-S in cr[12]-codfw.

Scheduled for the 31st at 15:00UTC (1h total).

Wed, Jul 24, 8:34 PM · netops, Operations, ops-codfw
ayounsi updated the task description for T228112: Cable mr1-codfw<->cr1/2-codfw through asw-a-codfw.
Wed, Jul 24, 5:12 PM · Operations, netops, ops-codfw
ayounsi updated the task description for T228112: Cable mr1-codfw<->cr1/2-codfw through asw-a-codfw.
Wed, Jul 24, 5:07 PM · Operations, netops, ops-codfw
ayounsi closed T228823: Faulty A6/A7 VC link as Resolved.

All done, no more errors or packet loss.

Wed, Jul 24, 3:07 PM · ops-eqiad, Operations
ayounsi updated the task description for T228823: Faulty A6/A7 VC link.
Wed, Jul 24, 3:05 PM · ops-eqiad, Operations
ayounsi added a comment to T228827: Instability of the Level3 link between cr2-eqiad and cr2-esams.

Level3/CenturyLink opened a ticket for that circuit and completed an emergency maintenance.
I also see some planned maintenance in the last few days.
And have at least one upcoming on 2019-07-31 04:00 GMT.
If it's service impacting we can de-pref the link by increasing its OSPF metric.

Wed, Jul 24, 2:36 PM · Operations, netops
ayounsi added a comment to T226782: a1-eqiad pdu refresh (Thursday 9/12 @11am UTC).

Seems like only 1 interface is master on cr1 the following is needed to fail it over

[edit interfaces ae2 unit 1202 family inet6 address 2620:0:861:202:fe00::1/64 vrrp-inet6-group 2]
+        priority 70;

cr1 going down would be noticeable, but OSPF is quick to failover (plus cr2 is the preferred path to codfw and esams), BGP we can pre-emptively disable the external peers but this will have some user facing impact (even though less than the device going down).
This would be a good use of BGP graceful shutdown (T211728).
As a power loss is quite unlikely if I understand correctly, I'd suggest to not do any routing changes. If anything goes bad we can fully depool cr2 when we tackle A8.

Wed, Jul 24, 2:37 AM · DC-Ops, Operations, ops-eqiad
ayounsi triaged T228824: Add VCP stats monitoring as Normal priority.
Wed, Jul 24, 2:12 AM · observability, netops, Operations
ayounsi triaged T228823: Faulty A6/A7 VC link as High priority.
Wed, Jul 24, 1:57 AM · ops-eqiad, Operations
ayounsi lowered the priority of T228617: AS63541's session down reported by cr1-eqsin from Normal to Lowest.

Email sent to Chinacache. If no replies in ~1w then we will remove the session.

Wed, Jul 24, 12:50 AM · netops, Operations

Jul 18 2019

ayounsi added a comment to T228275: Use centrallog1001 for network devices syslog.

Only solution I found so far on Juniper is to deactivate/activate that syslog target (tested with cr2-esams).

Jul 18 2019, 10:25 PM · netops, User-fgiunchedi, Operations
ayounsi reopened T223458: mgmt outages for cloud* systems seem to page everyone as "Open".

reopening it as there was a sub-issue mentioned in T223458#5238223. Might be worth forking it into its own task though.

Jul 18 2019, 9:16 PM · Patch-For-Review, cloud-services-team (Kanban)
ayounsi added a comment to T228086: Swift TCP retransmits increase.

Not sure if relevant or not, but cluster wmcs also shows elevated retransmits around the same period:

Jul 18 2019, 3:01 PM · User-fgiunchedi, Operations, media-storage

Jul 17 2019

ayounsi added a comment to T228277: Use centrallog1001 instead of lithium for PDU syslog.

PDUs should be set to use the CNAMEs syslog.codfw.wmnet and syslog.eqiad.wmnet is it possible to change the CNAMEs instead?

Jul 17 2019, 4:00 PM · DC-Ops, User-fgiunchedi
ayounsi added a comment to T228275: Use centrallog1001 for network devices syslog.

Network devices are set to use the CNAMEs syslog.codfw.wmnet and syslog.eqiad.wmnet is it possible to change the CNAMEs instead?

Jul 17 2019, 3:42 PM · netops, User-fgiunchedi, Operations

Jul 16 2019

ayounsi closed T228205: deploy pfw policy eqiad-1563305452 & codfw-1563305452, a subtask of T228164: refresh IP address list for maxmind API, as Resolved.
Jul 16 2019, 8:00 PM · fundraising-tech-ops
ayounsi closed T228205: deploy pfw policy eqiad-1563305452 & codfw-1563305452 as Resolved.

Done.

Jul 16 2019, 8:00 PM · Operations, netops
ayounsi added a comment to T186550: Anycast recdns.

It will eventually. Only a few servers are using the new IPs for now, I opened T228190 to roll it out.

Jul 16 2019, 4:51 PM · Patch-For-Review, netops, Operations, Traffic
ayounsi triaged T228190: Roll out Anycast RecDNS to more servers as Normal priority.
Jul 16 2019, 4:50 PM · Patch-For-Review, Operations, Traffic
ayounsi closed T186550: Anycast recdns, a subtask of T98006: Anycast (Auth)DNS, as Resolved.
Jul 16 2019, 12:53 AM · Performance-Team (Radar), Patch-For-Review, netops, Operations, Traffic
ayounsi closed T186550: Anycast recdns as Resolved.

Everything in the scope of that task is completed.

Jul 16 2019, 12:53 AM · Patch-For-Review, netops, Operations, Traffic

Jul 15 2019

ayounsi added a comment to T167841: Cleanup confed BGP peerings and policies.

Finally, the more user-visible issue that we have right now is that we're underutilizing eqord: we currently do not announce our supernets from eqord. The reason for this is that I hadn't found an easy way to guarantee that it wouldn't be announced if both eqiad<->eqord and eqord<->codfw was down, but eqord<->ulsfo and ulsfo<->codfw was up. The only solution that I could think of was splitting eqord in its own subAS and then doing a cross-subAS import policy with a ^65001 regexp.

Jul 15 2019, 10:37 PM · Operations, netops
ayounsi updated the task description for T228112: Cable mr1-codfw<->cr1/2-codfw through asw-a-codfw.
Jul 15 2019, 10:19 PM · Operations, netops, ops-codfw
ayounsi added a parent task for T224250: Setup new msw1-codfw: T225121: (Need By: Sept 30) upgrade msw1-eqiad from EX4200 to EX4300.
Jul 15 2019, 10:15 PM · ops-codfw, netops, Operations
ayounsi added a subtask for T225121: (Need By: Sept 30) upgrade msw1-eqiad from EX4200 to EX4300: T224250: Setup new msw1-codfw.
Jul 15 2019, 10:15 PM · netops, ops-eqiad, Operations
ayounsi added a parent task for T228112: Cable mr1-codfw<->cr1/2-codfw through asw-a-codfw: T224250: Setup new msw1-codfw.
Jul 15 2019, 10:14 PM · Operations, netops, ops-codfw
ayounsi added a subtask for T224250: Setup new msw1-codfw: T228112: Cable mr1-codfw<->cr1/2-codfw through asw-a-codfw.
Jul 15 2019, 10:14 PM · ops-codfw, netops, Operations
ayounsi triaged T228112: Cable mr1-codfw<->cr1/2-codfw through asw-a-codfw as Normal priority.
Jul 15 2019, 10:14 PM · Operations, netops, ops-codfw
ayounsi closed T227967: mr1-eqsin.oob IPv6 connectivity flapping as Resolved.

Seems like fixing T228015 fixed that issue as well.

Jul 15 2019, 9:12 PM · Operations, netops
ayounsi closed T228015: IPv6 packet loss registered by the Ripe Atlas anchor in eqsin as Resolved.

They were very quick to reply and fix the issue.

Jul 15 2019, 9:09 PM · Operations, netops
ayounsi added a comment to T227967: mr1-eqsin.oob IPv6 connectivity flapping.

So far I don't think there is a link between the ripe alerts and the oob alerts.

Jul 15 2019, 9:01 PM · Operations, netops
ayounsi added a comment to T228015: IPv6 packet loss registered by the Ripe Atlas anchor in eqsin.

Seems like HE in eqsin is having a bad time.
I depref all AS paths that go through HE and packet loss stopped.
Emailed HE's NOC.

Jul 15 2019, 8:59 PM · Operations, netops
ayounsi added a comment to T228086: Swift TCP retransmits increase.

The same thing started to happen around the same time for labstore1007: https://grafana.wikimedia.org/d/SxmTH3IZk/arzhels-playground?orgId=1&panelId=2&fullscreen&from=now-30d&to=now (temporary dashboard)

Jul 15 2019, 8:07 PM · User-fgiunchedi, Operations, media-storage
ayounsi updated subscribers of T228086: Swift TCP retransmits increase.

If we narrow it down to ms-fe* hosts they regularly spike between 5% and 15% which is a bit more worrying.
https://grafana.wikimedia.org/d/SxmTH3IZk/arzhels-playground?orgId=1&panelId=3&fullscreen&from=now-30d&to=now (temporary dashboard)

Jul 15 2019, 8:00 PM · User-fgiunchedi, Operations, media-storage
ayounsi triaged T228086: Swift TCP retransmits increase as High priority.
Jul 15 2019, 6:20 PM · User-fgiunchedi, Operations, media-storage

Jul 14 2019

ayounsi claimed T227967: mr1-eqsin.oob IPv6 connectivity flapping.

Thanks, email sent to Equinix NOC.
So far I don't think there is a link between the ripe alerts and the oob alerts.

Jul 14 2019, 11:36 PM · Operations, netops

Jul 12 2019

ayounsi added a comment to T221156: cr4-ulsfo rebooted unexpectedly.

Still no news, asked to escalate the case.

Jul 12 2019, 6:40 PM · Operations, netops
ayounsi added a comment to T218751: Audit down ports.

Note that from https://librenms.wikimedia.org/ports/state=down/hostname=asw/format=list_basic/ there are still & new down ports on A/B/C.

Jul 12 2019, 2:13 AM · DC-Ops, ops-eqiad, Operations
ayounsi updated subscribers of T224250: Setup new msw1-codfw.

T84333 is when the msw1<->cr connection has been made.
The other option would be to add a 10G extension module to msw1 (EX-UM-4X4SFP), @Papaul, do you have any spares?
As the MPC3D doesn't support 1G uplinks.

Jul 12 2019, 2:06 AM · ops-codfw, netops, Operations

Jul 11 2019

ayounsi added a parent task for T227808: Standardize cross confederation BGP policies: T167841: Cleanup confed BGP peerings and policies.
Jul 11 2019, 6:29 PM · Operations, netops