Page MenuHomePhabricator

ayounsi (Arzhel Younsi)
Network Engineer

Projects (7)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Apr 3 2017, 6:23 PM (140 w, 2 h)
Availability
Available
IRC Nick
xionox
LDAP User
Ayounsi
MediaWiki User
AYounsi (WMF) [ Global Accounts ]

Recent Activity

Thu, Dec 5

ayounsi committed rOSNEe2670d404486: Netbox Juniper installed base report (authored by ayounsi).
Netbox Juniper installed base report
Thu, Dec 5, 11:20 PM
ayounsi committed rOSNEeace9f142ddb: Ignore SJ Manufacturing ThruPower devices for LibreNMS report (authored by ayounsi).
Ignore SJ Manufacturing ThruPower devices for LibreNMS report
Thu, Dec 5, 11:20 PM
ayounsi committed rOSNEf32bc898f436: Format README, remove mention to oldhardware.py (authored by ayounsi).
Format README, remove mention to oldhardware.py
Thu, Dec 5, 11:20 PM
ayounsi committed rOSNE5392495032ad: Exclude esams from management report (authored by ayounsi).
Exclude esams from management report
Thu, Dec 5, 11:20 PM

Tue, Nov 26

ayounsi triaged T239256: Add monitoring for BGP peers exceeding prefix-limit as Medium priority.
Tue, Nov 26, 5:48 PM · Operations, netops
ayounsi added a comment to T224888: Network port utilization alerts should be paging .

That looks good! We might want to create a specific LibreNMS alert for the transit/peering links only, but can start with the existing ones.

Tue, Nov 26, 4:11 PM · observability, Traffic, netops, Operations
ayounsi closed T239098: Duplicate cable label in cr1-eqiad/cr2-eqiad as Resolved.

Updated.

Tue, Nov 26, 7:36 AM · DC-Ops, ops-eqiad, Operations
ayounsi closed T237030: Setup new MX204 in knams, a subtask of T235805: ESAMS Refresh/Rebuild (October 2019), as Resolved.
Tue, Nov 26, 6:44 AM · Patch-For-Review, DC-Ops, Operations, ops-esams
ayounsi closed T237030: Setup new MX204 in knams as Resolved.

All done.

Tue, Nov 26, 6:44 AM · netops, Operations, ops-esams

Mon, Nov 25

ayounsi added a comment to T238919: Cleanup Netbox stuff from netmon hosts.
  • Postgres is left running. It does not appear to be used by anything else on the box. @ayounsi is this the case? If so I can remove that as well.

Indeed.

Mon, Nov 25, 4:48 PM · netbox
ayounsi closed T237031: Bundle esams-knams links back as Resolved.

All done. Not re-enabling knams transits as we're setting up the new MX204 right now.

Mon, Nov 25, 4:39 PM · Operations, ops-esams
ayounsi updated the task description for T237030: Setup new MX204 in knams.
Mon, Nov 25, 2:35 PM · netops, Operations, ops-esams

Fri, Nov 22

ayounsi closed T238781: dropped packets to phab1003 22280/tcp, a subtask of T238593: Phabricator downtime due to aphlict and websockets (aphlict current disabled), as Resolved.
Fri, Nov 22, 7:41 PM · Traffic, Operations, serviceops, Phabricator
ayounsi closed T238781: dropped packets to phab1003 22280/tcp as Resolved.

Confirmed!

Fri, Nov 22, 7:41 PM · Operations, serviceops

Thu, Nov 21

ayounsi added a comment to T238795: The "logstash-*" index pattern does not contain any of the following field types: ip .

From https://logstash.wikimedia.org/app/kibana#/dashboard/69b9fbe0-3c1b-11e8-90f7-4958fd3a62b4
src-ip
dst-ip

Thu, Nov 21, 5:42 PM · Operations, observability

Wed, Nov 20

ayounsi triaged T238795: The "logstash-*" index pattern does not contain any of the following field types: ip as Lowest priority.
Wed, Nov 20, 9:19 PM · Operations, observability
ayounsi added a comment to T238794: dropped packets to kafkamon 9000/tcp.

In addition prometheus2003/4.codfw.wmnet are also trying to reach 9700/tcp on kafkamon2001 only.

Wed, Nov 20, 8:46 PM · Operations, observability
ayounsi triaged T238794: dropped packets to kafkamon 9000/tcp as Medium priority.
Wed, Nov 20, 8:42 PM · Operations, observability
ayounsi created T238791: dropped packets to conf1004/5/6 2379/tcp.
Wed, Nov 20, 8:34 PM · Operations, serviceops, observability
ayounsi triaged T238789: dropped packets to echostore.svc.eqiad 8082/tcp as Medium priority.
Wed, Nov 20, 8:27 PM · Operations, serviceops
ayounsi triaged T238781: dropped packets to phab1003 22280/tcp as Medium priority.
Wed, Nov 20, 7:06 PM · Operations, serviceops

Tue, Nov 19

ayounsi closed T238677: "unknown session id" from bird on centrallog hosts as Resolved.

Clearing the BFD session on the router and restarting bird solved the issue.
If it happen again please reopen and I'll investigate it more.

Tue, Nov 19, 6:08 PM · netops, Operations
ayounsi claimed T238677: "unknown session id" from bird on centrallog hosts.
Tue, Nov 19, 5:33 PM · netops, Operations

Mon, Nov 18

ayounsi added a comment to T227632: Document PDU models.

In Netbox, Smart CDU & Switched CDU are generic Types. They should be replaced by the exact model.

Mon, Nov 18, 7:27 PM · netbox, Operations, ops-codfw, ops-eqiad

Sat, Nov 16

ayounsi committed rOHPU024d244581d7: Add security alg/forwarding-options/screen to mr template (authored by ayounsi).
Add security alg/forwarding-options/screen to mr template
Sat, Nov 16, 12:04 AM

Fri, Nov 15

ayounsi added a comment to T238416: Logstash doesn't parse ulogd source and destination ports.

Not that I can think of for now. Thanks!

Fri, Nov 15, 9:41 PM · Operations, observability
ayounsi committed rOHPU553ccc712f32: msw: ensure no vlans are configured (authored by ayounsi).
msw: ensure no vlans are configured
Fri, Nov 15, 7:12 PM
ayounsi triaged T238416: Logstash doesn't parse ulogd source and destination ports as Lowest priority.
Fri, Nov 15, 4:57 PM · Operations, observability
ayounsi triaged T238414: Write ulogd logs to a dedicated logfile as Medium priority.
Fri, Nov 15, 4:27 PM · observability, Operations

Wed, Nov 13

ayounsi updated the task description for T237031: Bundle esams-knams links back.
Wed, Nov 13, 8:45 PM · Operations, ops-esams
ayounsi updated the task description for T237030: Setup new MX204 in knams.
Wed, Nov 13, 7:48 PM · netops, Operations, ops-esams
ayounsi added a comment to T237009: Add missing labels for equipment and cables.

See also: https://netbox.wikimedia.org/extras/reports/cables.Cables/#test_blank_cable_label

Wed, Nov 13, 7:22 PM · DC-Ops, Operations, ops-esams
ayounsi added a comment to T237006: Relabel cables with duplicate IDs.

See also https://netbox.wikimedia.org/extras/reports/cables.Cables/#test_duplicate_cable_label

Wed, Nov 13, 7:22 PM · Operations, ops-esams
ayounsi added a comment to T238006: Icinga alert for hosts with no Puppet roles.

I don't know if the host showed up in Icinga.
The host was cloudcephmon1001.wikimedia.org but it happened to other hosts in the past.
I *think* it should be a check for the puppetmaster host though, maybe querying puppetDB?

Wed, Nov 13, 3:58 PM · Operations, observability, Puppet

Tue, Nov 12

ayounsi closed T211728: Outbound BGP graceful shutdown as Resolved.

All good!

Tue, Nov 12, 9:12 PM · Patch-For-Review, Operations, netops
ayounsi committed rOHPU36f5649396ca: Add policy-statement BGP_graceful_shutdown_out (authored by ayounsi).
Add policy-statement BGP_graceful_shutdown_out
Tue, Nov 12, 9:03 PM
ayounsi closed T238036: scs-c1-eqiad CPU usage over 85% as Resolved.

CPU is back to normal.

Tue, Nov 12, 4:33 PM · Operations, netops
ayounsi added a comment to T238036: scs-c1-eqiad CPU usage over 85%.
PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND    
1855 root      20   0  4576 2084  876 S  2.7  0.8  26623:27 portmanager

did a kill -9 portmanager just in case but it didn't change anything (the process restarted with the same 2% CPU load).
Then killed snmpd, which lower the CPU for a bit but then went back up.
Trying a reboot.

Tue, Nov 12, 4:21 PM · Operations, netops

Mon, Nov 11

ayounsi closed T238018: mw1239 - Memory correctable errors -EDAC- as Declined.

wfm.

Mon, Nov 11, 8:19 PM · Operations, serviceops
ayounsi triaged T238018: mw1239 - Memory correctable errors -EDAC- as High priority.
Mon, Nov 11, 7:05 PM · Operations, serviceops
ayounsi triaged T238006: Icinga alert for hosts with no Puppet roles as Medium priority.
Mon, Nov 11, 5:15 PM · Operations, observability, Puppet

Nov 7 2019

ayounsi created P9556 (An Untitled Masterwork).
Nov 7 2019, 10:10 PM
ayounsi triaged T237649: Boron disk space alert as High priority.
Nov 7 2019, 3:59 PM · User-jbond, Operations
ayounsi added a comment to T237308: Feedback on new alert setup.

This change is causing this Icinga alert:
https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=icinga1001&service=webpagereplay-enwiki-alerts+grafana+alert

UNKNOWN: failed to fetch info about dashboard with uid=000000748 due to exception: HTTP Error 404: Not Found

Nov 7 2019, 3:51 PM · WebPageReplay, Performance-Team
ayounsi added a comment to T237598: Create new Phab project tag "homer".

@Aklapper sure!

Homer is Wikimedia's network configuration manager

https://wikitech.wikimedia.org/wiki/Homer

Nov 7 2019, 3:19 PM · Project-Admins
ayounsi created T237598: Create new Phab project tag "homer".
Nov 7 2019, 1:21 AM · Project-Admins
ayounsi closed T236878: Improve resiliency of the eqsin transport link as Resolved.

Damping configured.

Nov 7 2019, 12:27 AM · Wikimedia-Incident, Operations, netops
ayounsi updated subscribers of T237587: Determine & implement near-term method for escalating network alerts.

Interface saturation

See also T224888

Nov 7 2019, 12:04 AM · Operations, netops, observability

Nov 6 2019

ayounsi triaged T237567: mw1247: IPMI Sensor Status UNKNOWN internal IPMI error as Medium priority.
Nov 6 2019, 7:02 PM · ops-eqiad, Operations
ayounsi added a project to T224888: Network port utilization alerts should be paging : observability.
Nov 6 2019, 4:35 PM · observability, Traffic, netops, Operations

Nov 2 2019

ayounsi committed rOHPU1c779abaf419: Rename site to metadata['site'] (authored by ayounsi).
Rename site to metadata['site']
Nov 2 2019, 12:39 AM

Nov 1 2019

ayounsi added a comment to T229710: read-only user netbox permissions regression.

No strong preferences, that looks like a good fix indeed.

Nov 1 2019, 7:33 PM · netbox
ayounsi committed rOHPU23fd1192a340: Add term vmhost to cr loopback4 filter (authored by ayounsi).
Add term vmhost to cr loopback4 filter
Nov 1 2019, 4:35 PM
ayounsi added a comment to T213843: Juniper network device audit - all sites.

A lot of back and forth with Juniper, current status is:
test_consistency fails 70
test_missing_device_from_installed_base 4
test_missing_inventory_from_installed_base 183

Nov 1 2019, 4:14 PM · DC-Ops, netops, Operations

Oct 31 2019

ayounsi reopened T237011: Update DNS/NTP servers on the esams PDUs/SCS as "Open".

All network devices updated.

Oct 31 2019, 11:43 PM · DC-Ops, ops-esams, Operations
ayounsi renamed T236785: Configure conditional advertising in eqdfw and knams from Configure conditional advertizing in eqdfw and knams to Configure conditional advertising in eqdfw and knams.
Oct 31 2019, 10:48 PM · Operations, netops
ayounsi triaged T237031: Bundle esams-knams links back as Medium priority.
Oct 31 2019, 3:22 PM · Operations, ops-esams
ayounsi triaged T237030: Setup new MX204 in knams as Medium priority.
Oct 31 2019, 3:18 PM · netops, Operations, ops-esams
ayounsi triaged T237027: Upgrade cr2-esams to JTAC recommended as Low priority.
Oct 31 2019, 3:12 PM · Operations, netops
ayounsi triaged T237014: Update spare QFX labels as Low priority.
Oct 31 2019, 2:04 PM · ops-esams, Operations
ayounsi added a comment to T237007: Add a Netbox check for duplicate cable IDs.

A report with cables with no IDs (label) as well would be useful.

Oct 31 2019, 1:22 PM · DC-Ops, SRE-tools, netbox
ayounsi updated the task description for T205897: Netbox: fill network topology.
Oct 31 2019, 12:50 PM · netbox, Operations

Oct 30 2019

ayounsi closed T236598: cr3-esams crash as Resolved.

Power cycled CB1 (hosting re1) following https://kb.juniper.net/InfoCenter/index?page=content&id=KB14278&cat=JUNOS&actp=LIST and RE1 is now back online in a healthy state.

Oct 30 2019, 11:17 PM · Operations, netops
ayounsi added a comment to T236598: cr3-esams crash.

re1 is unresponsive, even through console.
We have 2 options to try to power cycle it:

  • Have someone onsite unseat/reseat the card (non disruptive)
  • Power cycle the whole router (disruptive)
Oct 30 2019, 8:31 AM · Operations, netops
ayounsi triaged T236878: Improve resiliency of the eqsin transport link as Medium priority.
Oct 30 2019, 8:21 AM · Wikimedia-Incident, Operations, netops

Oct 29 2019

ayounsi renamed T236785: Configure conditional advertising in eqdfw and knams from Configure conditional advertizing to eqdfw and knams to Configure conditional advertizing in eqdfw and knams.
Oct 29 2019, 12:47 PM · Operations, netops
ayounsi updated the task description for T236785: Configure conditional advertising in eqdfw and knams.
Oct 29 2019, 12:44 PM · Operations, netops
ayounsi triaged T236785: Configure conditional advertising in eqdfw and knams as Medium priority.
Oct 29 2019, 12:44 PM · Operations, netops
ayounsi triaged T236767: cr3-esams:et-1/0/0 flap as Medium priority.
Oct 29 2019, 10:11 AM · Operations, ops-esams
ayounsi added a comment to T236598: cr3-esams crash.

All the interfaces are back up and cr3-esams is now reachable and in service.

Oct 29 2019, 8:24 AM · Operations, netops
ayounsi added a comment to T236598: cr3-esams crash.

We have found matching PR1179822, Chassisd might crash if lo0 filter is configured without allowing communication between RE and VM-host on RE. As a result,the internal interfaces are incorrectly examined by lo0 filter, none of the FPC's will be online and no interface will be created.
Workaround
Please allow the internal communication between RE and VM-Host in lo0 filter (if lo0 filter being used)
Read more at: https://prsearch.juniper.net/InfoCenter/index?page=prcontent&id=PR1179822

Oct 29 2019, 8:14 AM · Operations, netops

Oct 26 2019

ayounsi added a comment to T236598: cr3-esams crash.

Opened Juniper case 2019-1026-0004.

Oct 26 2019, 4:26 PM · Operations, netops
ayounsi closed T184065: Setup new access switches, a subtask of T184064: Prepare racks OE14, OE15 and OE16 with new infrastructure, as Resolved.
Oct 26 2019, 10:03 AM · Operations, ops-esams
ayounsi closed T184065: Setup new access switches, a subtask of T215991: Repurpose csw2-oe14/15 and lab-ex4200 as msw, as Resolved.
Oct 26 2019, 10:03 AM · Operations, ops-esams
ayounsi closed T184065: Setup new access switches as Resolved.

This is done.

Oct 26 2019, 10:03 AM · Operations, ops-esams

Oct 23 2019

ayounsi closed T184067: Complete router migration from cr1-esams to cr3-esams as Resolved.

Done.

Oct 23 2019, 6:03 PM · netops, Operations, ops-esams
ayounsi closed T184067: Complete router migration from cr1-esams to cr3-esams, a subtask of T184061: SRE 2017-18 Q3 goal Cleanup esams and refresh servers and infrastructure (tracking), as Resolved.
Oct 23 2019, 6:03 PM · Operations, Epic, ops-esams
ayounsi closed T174616: set up cr3-esams, a subtask of T184067: Complete router migration from cr1-esams to cr3-esams, as Resolved.
Oct 23 2019, 6:03 PM · netops, Operations, ops-esams
ayounsi closed T174616: set up cr3-esams, a subtask of T196432: Configure interface damping on primary links, as Resolved.
Oct 23 2019, 6:03 PM · Wikimedia-Incident, Operations, Traffic, netops
ayounsi closed T174616: set up cr3-esams as Resolved.

Done.

Oct 23 2019, 6:03 PM · ops-esams, Operations, netops

Oct 22 2019

ayounsi reopened Unknown Object (Task), a subtask of T235805: ESAMS Refresh/Rebuild (October 2019), as Open.
Oct 22 2019, 5:32 AM · Patch-For-Review, DC-Ops, Operations, ops-esams

Oct 18 2019

ayounsi triaged T235886: IRR updates needed as Low priority.
Oct 18 2019, 3:13 PM · Operations, netops

Oct 15 2019

ayounsi closed T186550: Anycast recdns, a subtask of T98006: Anycast AuthDNS, as Resolved.
Oct 15 2019, 7:53 AM · Performance-Team (Radar), Patch-For-Review, netops, Operations, Traffic
ayounsi closed T186550: Anycast recdns as Resolved.
Oct 15 2019, 7:53 AM · Patch-For-Review, netops, Operations, Traffic
ayounsi added a comment to T186550: Anycast recdns.

Couple of notes about the anycast-healthchecker:

  1. python3-docopt seems to be required by the healthchecker's nagios monitor, and it was missing on lithium. I installed it manually via apt on it and opened https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/526849/
Oct 15 2019, 7:53 AM · Patch-For-Review, netops, Operations, Traffic

Oct 11 2019

ayounsi added a comment to T226778: Install new PDUs in rows A/B (Top level tracking task).

Can I suggest a few modifications to the PDU swap checklist of each task? Mostly to clear out the alerting noise
Under: "schedule downtime for the entire list of switches and servers"
Add:
[] Downtime PDUs in Icinga for the time of the maintenance + time for the new one to get re-configured
I know this can be controversial as people use Icinga different ways, but I believe this is best practice

Oct 11 2019, 8:13 AM · DC-Ops, Operations, ops-eqiad

Oct 10 2019

ayounsi added a project to T235162: Restrict GIDs for system users to 499 as the upper boundary: Operations.
Oct 10 2019, 10:14 AM · User-jbond, Operations
ayounsi added a comment to T232007: Restbase: significant increase of outbound dropped packets.

Back to normal on 10-07
https://grafana.wikimedia.org/d/000000366/network-performances-global?panelId=21&fullscreen&edit&tab=alert&orgId=1&from=1570407765208&to=1570486140039

Oct 10 2019, 8:14 AM · service-runner, RESTBase, User-mobrovac, Core Platform Team Workboards (Clinic Duty Team)

Oct 9 2019

ayounsi added a comment to T227541: b6-eqiad pdu refresh (Tuesday 9/10 @11am UTC).

It was a PDU miss-configuration and a monitoring issue. Was solved in https://phabricator.wikimedia.org/T229328

Oct 9 2019, 4:33 PM · DC-Ops, Operations, ops-eqiad

Oct 8 2019

ayounsi closed T232617: BGP sessions down on cr2-esams as Resolved.

Seems like they had 4 sessions in total.

Oct 8 2019, 3:31 PM · netops, Operations

Oct 7 2019

ayounsi committed rOHPUf10738beb2b8: Add kerberos hosts to analytics-in4 + add kerberos to analytics-in6 (authored by ayounsi).
Add kerberos hosts to analytics-in4 + add kerberos to analytics-in6
Oct 7 2019, 8:57 PM
ayounsi committed rOHPU22d3734239d6: Add BGP prefix damping to IX policies (authored by ayounsi).
Add BGP prefix damping to IX policies
Oct 7 2019, 8:57 PM
ayounsi committed rOSHO4ffe0e1f0c99: Add commit action to the Homer class (authored by Volans).
Add commit action to the Homer class
Oct 7 2019, 5:41 PM
ayounsi closed T222424: configure BGP route damping on IX sessions as Resolved.

All done!

Oct 7 2019, 5:29 PM · Operations, netops
ayounsi added a project to T234831: Massmessage only arriving on Flow-user talk pages: Operations.
Oct 7 2019, 3:56 PM · MassMessage

Oct 3 2019

ayounsi committed rOHMP0a359bb6fbb2: README, common and asw2-a/b/c-eqiad mock private data (authored by ayounsi).
README, common and asw2-a/b/c-eqiad mock private data
Oct 3 2019, 6:13 PM

Oct 2 2019

ayounsi closed T234416: asw2-a-eqiad <-> cr2-eqiad fiber issue as Resolved.
ayounsi@asw2-a-eqiad> show interfaces diagnostics optics xe-7/0/46 | match "rx|receive" 
    Receiver signal average optical power     :  0.0741 mW / -11.30 dBm
    Laser rx power high alarm                 :  Off
    Laser rx power low alarm                  :  Off
    Laser rx power high warning               :  Off
    Laser rx power low warning                :  Off
    Laser rx power high alarm threshold       :  1.0000 mW / 0.00 dBm
    Laser rx power low alarm threshold        :  0.0100 mW / -20.00 dBm
    Laser rx power high warning threshold     :  0.7943 mW / -1.00 dBm
    Laser rx power low warning threshold      :  0.0126 mW / -19.00 dBm
Oct 2 2019, 11:43 PM · netops, Operations, ops-eqiad
ayounsi added a comment to T222424: configure BGP route damping on IX sessions.

Eqord:

Suppressed due to damping:    4
Suppressed due to damping:    4
Suppressed due to damping:    1
Suppressed due to damping:    1

eqdfw:

Suppressed due to damping:    1
Suppressed due to damping:    1
Suppressed due to damping:    1
Oct 2 2019, 6:35 PM · Operations, netops
ayounsi added a comment to T222424: configure BGP route damping on IX sessions.

For the record:

cr4-ulsfo> show bgp neighbor | match "Suppressed due to damping"| except "    0"                      
    Suppressed due to damping:    1
    Suppressed due to damping:    1
    Suppressed due to damping:    27
    Suppressed due to damping:    1
    Suppressed due to damping:    1
    Suppressed due to damping:    2
    Suppressed due to damping:    2
    Suppressed due to damping:    3
    Suppressed due to damping:    1
    Suppressed due to damping:    1

This is out of ~120 BGP sessions, the 27 is out of ~50000 prefixes advertised by this peer.

Oct 2 2019, 6:22 PM · Operations, netops
ayounsi added a comment to T222424: configure BGP route damping on IX sessions.

Updated change with the above feedbacks:

[edit protocols bgp group IX4]
+    damping;
[edit protocols bgp group IX6]
+    damping;
[edit policy-options policy-statement BGP_IXP_in]
     term rpki-invalids { ... }
+    /* T222424 */
+    term damping {
+        then damping default;
+    }
[edit policy-options]
+   /* T222424 */
+   damping default {
+       half-life 15;
+       reuse 2000;
+       suppress 6000;
+       max-suppress 60;
+   }
Oct 2 2019, 6:14 PM · Operations, netops