Page MenuHomePhabricator

ayounsi (Arzhel Younsi)
Network Engineer

Projects (6)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Apr 3 2017, 6:23 PM (132 w, 4 h)
Availability
Available
IRC Nick
xionox
LDAP User
Ayounsi
MediaWiki User
AYounsi (WMF) [ Global Accounts ]

Recent Activity

Fri, Oct 11

ayounsi added a comment to T226778: Install new PDUs in rows A/B (Top level tracking task).

Can I suggest a few modifications to the PDU swap checklist of each task? Mostly to clear out the alerting noise
Under: "schedule downtime for the entire list of switches and servers"
Add:
[] Downtime PDUs in Icinga for the time of the maintenance + time for the new one to get re-configured
I know this can be controversial as people use Icinga different ways, but I believe this is best practice

Fri, Oct 11, 8:13 AM · DC-Ops, Operations, ops-eqiad

Thu, Oct 10

ayounsi added a project to T235162: Restrict GIDs for system users to 499 as the upper boundary: Operations.
Thu, Oct 10, 10:14 AM · Patch-For-Review, Operations
ayounsi added a comment to T232007: Restbase: significant increase of outbound dropped packets.

Back to normal on 10-07
https://grafana.wikimedia.org/d/000000366/network-performances-global?panelId=21&fullscreen&edit&tab=alert&orgId=1&from=1570407765208&to=1570486140039

Thu, Oct 10, 8:14 AM · service-runner, RESTBase, User-mobrovac, Core Platform Team Workboards (Clinic Duty Team)

Wed, Oct 9

ayounsi added a comment to T227541: b6-eqiad pdu refresh (Tuesday 9/10 @11am UTC).

It was a PDU miss-configuration and a monitoring issue. Was solved in https://phabricator.wikimedia.org/T229328

Wed, Oct 9, 4:33 PM · DC-Ops, Operations, ops-eqiad

Tue, Oct 8

ayounsi closed T232617: BGP sessions down on cr2-esams as Resolved.

Seems like they had 4 sessions in total.

Tue, Oct 8, 3:31 PM · Operations, netops

Mon, Oct 7

ayounsi committed rOHPUf10738beb2b8: Add kerberos hosts to analytics-in4 + add kerberos to analytics-in6 (authored by ayounsi).
Add kerberos hosts to analytics-in4 + add kerberos to analytics-in6
Mon, Oct 7, 8:57 PM
ayounsi committed rOHPU22d3734239d6: Add BGP prefix damping to IX policies (authored by ayounsi).
Add BGP prefix damping to IX policies
Mon, Oct 7, 8:57 PM
ayounsi committed rOSHO4ffe0e1f0c99: Add commit action to the Homer class (authored by Volans).
Add commit action to the Homer class
Mon, Oct 7, 5:41 PM
ayounsi closed T222424: configure BGP route damping on IX sessions as Resolved.

All done!

Mon, Oct 7, 5:29 PM · Operations, netops
ayounsi added a project to T234831: Massmessage only arriving on Flow-user talk pages: Operations.
Mon, Oct 7, 3:56 PM · MassMessage

Thu, Oct 3

ayounsi committed rOHMP0a359bb6fbb2: README, common and asw2-a/b/c-eqiad mock private data (authored by ayounsi).
README, common and asw2-a/b/c-eqiad mock private data
Thu, Oct 3, 6:13 PM

Wed, Oct 2

ayounsi closed T234416: asw2-a-eqiad <-> cr2-eqiad fiber issue as Resolved.
ayounsi@asw2-a-eqiad> show interfaces diagnostics optics xe-7/0/46 | match "rx|receive" 
    Receiver signal average optical power     :  0.0741 mW / -11.30 dBm
    Laser rx power high alarm                 :  Off
    Laser rx power low alarm                  :  Off
    Laser rx power high warning               :  Off
    Laser rx power low warning                :  Off
    Laser rx power high alarm threshold       :  1.0000 mW / 0.00 dBm
    Laser rx power low alarm threshold        :  0.0100 mW / -20.00 dBm
    Laser rx power high warning threshold     :  0.7943 mW / -1.00 dBm
    Laser rx power low warning threshold      :  0.0126 mW / -19.00 dBm
Wed, Oct 2, 11:43 PM · netops, ops-eqiad, Operations
ayounsi added a comment to T222424: configure BGP route damping on IX sessions.

Eqord:

Suppressed due to damping:    4
Suppressed due to damping:    4
Suppressed due to damping:    1
Suppressed due to damping:    1

eqdfw:

Suppressed due to damping:    1
Suppressed due to damping:    1
Suppressed due to damping:    1
Wed, Oct 2, 6:35 PM · Operations, netops
ayounsi added a comment to T222424: configure BGP route damping on IX sessions.

For the record:

cr4-ulsfo> show bgp neighbor | match "Suppressed due to damping"| except "    0"                      
    Suppressed due to damping:    1
    Suppressed due to damping:    1
    Suppressed due to damping:    27
    Suppressed due to damping:    1
    Suppressed due to damping:    1
    Suppressed due to damping:    2
    Suppressed due to damping:    2
    Suppressed due to damping:    3
    Suppressed due to damping:    1
    Suppressed due to damping:    1

This is out of ~120 BGP sessions, the 27 is out of ~50000 prefixes advertised by this peer.

Wed, Oct 2, 6:22 PM · Operations, netops
ayounsi added a comment to T222424: configure BGP route damping on IX sessions.

Updated change with the above feedbacks:

[edit protocols bgp group IX4]
+    damping;
[edit protocols bgp group IX6]
+    damping;
[edit policy-options policy-statement BGP_IXP_in]
     term rpki-invalids { ... }
+    /* T222424 */
+    term damping {
+        then damping default;
+    }
[edit policy-options]
+   /* T222424 */
+   damping default {
+       half-life 15;
+       reuse 2000;
+       suppress 6000;
+       max-suppress 60;
+   }
Wed, Oct 2, 6:14 PM · Operations, netops
ayounsi closed T234335: Telia IC-314534 (eqord/eqdfw 10Gbps wave) down as Resolved.

Work completed, everything is up, thank to you two!

Wed, Oct 2, 3:53 PM · Operations, netops
ayounsi assigned T234416: asw2-a-eqiad <-> cr2-eqiad fiber issue to Cmjohnson.

Related to T203719.

Wed, Oct 2, 3:38 PM · netops, ops-eqiad, Operations
ayounsi assigned T234411: msw-c1 down? to Papaul.

@Papaul, can you check the LED status, cables (all properly connected), then power cycle the device?

Wed, Oct 2, 3:26 PM · netops, ops-codfw, Operations

Tue, Oct 1

ayounsi closed T233645: asw2-d2-eqiad crash as Resolved.

Discussed during the Monday meeting, will leave it as it.

Tue, Oct 1, 3:27 PM · Wikimedia-Incident, Operations, netops

Mon, Sep 30

ayounsi added a project to T142862: Setup Kubernetes Masters in a HA setup: Wikimedia-Incident.
Mon, Sep 30, 9:55 PM · Wikimedia-Incident, Kubernetes, Toolforge, Tools-Kubernetes
ayounsi added a project to T232536: Toolforge Kubernetes internal API down, causing `webservice` and other tooling to fail: Wikimedia-Incident.
Mon, Sep 30, 9:54 PM · Wikimedia-Incident, cloud-services-team (Kanban), Toolforge
ayounsi added a comment to T222424: configure BGP route damping on IX sessions.

Great doc, thanks!
We can use 2000 for reuse, the following will happen:

Mon, Sep 30, 7:59 PM · Operations, netops
akosiaris awarded T222424: configure BGP route damping on IX sessions a Love token.
Mon, Sep 30, 3:08 PM · Operations, netops

Fri, Sep 27

ayounsi added a project to T233662: Logstash pipeline crashes on non-UTF8 log messages.: Wikimedia-Incident.
Fri, Sep 27, 9:15 PM · Wikimedia-Incident, Patch-For-Review, Wikimedia-Logstash, Operations
ayounsi removed a project from T234047: Extend firewall rules for new corp LDAP replicas: netops.
Fri, Sep 27, 6:30 PM · Operations
ayounsi updated subscribers of T222424: configure BGP route damping on IX sessions.

Maybe @jbond too!

Fri, Sep 27, 6:29 PM · Operations, netops
ayounsi added a comment to T234047: Extend firewall rules for new corp LDAP replicas.

There is only a mention of dubnium.wikimedia.org (208.80.154.13) in the analytics firewall filter.
If that task if for network devices only, feel free to close it. If it's for all types of firewalls (eg. ferm) please re-assign it.

Fri, Sep 27, 4:47 PM · Operations
ayounsi added a comment to T228827: Instability of the Level3 link between cr2-eqiad and cr2-esams.

Another one (scheduled as 17144179)
2019-09-26 23:32:28 xe-4/1/3 ifOperStatus: down -> up
2019-09-26 22:12:28 xe-4/1/3 ifOperStatus: up -> down

Fri, Sep 27, 4:40 PM · Operations, netops
ayounsi reopened T205712: (OoW) wtp2020: correctable memory errors as "Open".

This is alerting again: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=wtp2020&service=Memory+correctable+errors+-EDAC-

Fri, Sep 27, 4:33 PM · Operations, ops-codfw

Thu, Sep 26

ayounsi added a comment to T213843: Juniper network device audit - all sites.

I wrote a Netbox report to check against Juniper's installed base ( https://netbox.wikimedia.org/extras/reports/juniper.Juniper/ )
Still in review in https://gerrit.wikimedia.org/r/c/operations/software/netbox-reports/+/539192

Thu, Sep 26, 7:06 PM · DC-Ops, netops, Operations
ayounsi committed rLPRIa2b1b0c30f40: Add fake deploy homer ssh keys (authored by ayounsi).
Add fake deploy homer ssh keys
Thu, Sep 26, 4:26 PM

Wed, Sep 25

ayounsi removed a project from T233318: scs monitoring missing in Icinga: netops.
Wed, Sep 25, 6:48 PM · Icinga, observability, Operations
ayounsi added a comment to T232602: GRE MTU mitigations - Tracking.

@BBlack @faidon let me know when is a good time to remove that MSS hack on the routers.
To be done one router at a time with time in between for the sessions to re-establish. Will also drain NTT/Telia using BGP graceful shutdown beforehand.

Wed, Sep 25, 6:47 PM · Traffic, Operations
ayounsi committed rLPRI69bdeb4dffd7: Add fake SSH keypair for user homer (authored by ayounsi).
Add fake SSH keypair for user homer
Wed, Sep 25, 6:12 PM

Tue, Sep 24

ayounsi closed T83119: Netflow Collector Project as Resolved.

https://wikitech.wikimedia.org/wiki/Netflow

Tue, Sep 24, 10:12 PM · Operations, netops, observability
ayounsi added a comment to T233645: asw2-d2-eqiad crash.

The logs rolled over the weekend...

Tue, Sep 24, 9:54 PM · Wikimedia-Incident, Operations, netops
ayounsi added a comment to T211728: Outbound BGP graceful shutdown.

FYI I tested the policy from the description successfully during T226422 and T226424.

Tue, Sep 24, 9:40 PM · Operations, netops
ayounsi claimed T222424: configure BGP route damping on IX sessions.

Keeping the default damping settings (per the doc) here is what I think we should push to our routers:

[edit protocols bgp group IX4]
+    damping;
[edit protocols bgp group IX6]
+    damping;
[edit policy-options policy-statement BGP_IXP_in]
     term rpki-invalids { ... }
+    /* T222424 */
+    term damping {
+        then damping default;
+    }
[edit policy-options]
+   /* T222424 */
+   damping default {
+       half-life 15;
+       reuse 750;
+       suppress 3000;
+       max-suppress 60;
+   }

To Private peers as well.

Tue, Sep 24, 7:37 PM · Operations, netops
ayounsi renamed T222424: configure BGP route damping on IX sessions from cr2-esams: BGP flapping for AS 61955 (ipv4 and ipv6) to configure BGP route damping on IX sessions.
Tue, Sep 24, 7:29 PM · Operations, netops
ayounsi closed T189689: Connection timeout from 195.77.175.64/29 to text-lb.esams.wikimedia.org as Resolved.

No new updates since March 2018, feel free to reopen if the issue is still there.

Tue, Sep 24, 6:15 PM · netops, Operations
ayounsi edited projects for T201444: Refresh switch ports descriptions for recently renamed cloud servers, added: ops-eqiad; removed netops.
Tue, Sep 24, 6:09 PM · ops-eqiad, Operations, cloud-services-team, DC-Ops
ayounsi removed a project from T212878: Netbox racks consistency report: netops.
Tue, Sep 24, 6:01 PM · netbox, Operations
ayounsi removed a project from T220700: Upgrade kafka-jumbo100[1-6] to 10G NICs (if possible): netops.
Tue, Sep 24, 6:00 PM · ops-eqiad, hardware-requests, Operations, Analytics, User-Elukey
ayounsi closed T230005: BGP session down for AS4739 on cr4-ulsfo as Resolved.

Sessions are now established. Thanks!

Tue, Sep 24, 5:54 PM · netops, Operations
ayounsi closed T211254: Free up 185.15.59.0/24, a subtask of T207753: esams/knams: advertise 185.15.58.0/23 instead of 185.15.56.0/22, as Resolved.
Tue, Sep 24, 5:44 PM · Operations, netops
ayounsi closed T211254: Free up 185.15.59.0/24 as Resolved.

https://gerrit.wikimedia.org/r/c/operations/puppet/+/509140 was the last thing to do and it has been merged some time ago.

Tue, Sep 24, 5:44 PM · Patch-For-Review, Traffic, Operations, netops
ayounsi added a comment to T229682: Add more dimensions to netflow's druid ingestion specs.

Confirmed working as expected.
I manually checked the "null" country's IPs and they match "Anonymous proxies".

Tue, Sep 24, 3:14 PM · Analytics-Kanban, Analytics
ayounsi closed T233672: possible routing issue between eqiad and Maxmind network as Resolved.

Resolved by Cloudflare.

Tue, Sep 24, 2:47 AM · Operations, fundraising-tech-ops, netops
ayounsi claimed T233672: possible routing issue between eqiad and Maxmind network.

All those IPs are behind Cloudflare. Opened a ticket with them.

Tue, Sep 24, 2:03 AM · Operations, fundraising-tech-ops, netops
ayounsi updated the task description for T233672: possible routing issue between eqiad and Maxmind network.
Tue, Sep 24, 2:03 AM · Operations, fundraising-tech-ops, netops

Mon, Sep 23

ayounsi added a project to T233645: asw2-d2-eqiad crash: Wikimedia-Incident.
Mon, Sep 23, 6:45 PM · Wikimedia-Incident, Operations, netops
ayounsi updated the task description for T233645: asw2-d2-eqiad crash.
Mon, Sep 23, 6:10 PM · Wikimedia-Incident, Operations, netops
ayounsi triaged T233645: asw2-d2-eqiad crash as High priority.
Mon, Sep 23, 6:07 PM · Wikimedia-Incident, Operations, netops
ayounsi added a comment to T232412: HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed.

Getting the WordPress replies through a separate thread.

Mon, Sep 23, 3:41 PM · Wikimedia-Blog

Thu, Sep 19

ayounsi added a comment to T226782: a1-eqiad pdu refresh (Tuesday 10/15 @11am UTC).

This is alerting: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=ps1-a1-eqiad

Thu, Sep 19, 9:40 PM · DC-Ops, Operations, ops-eqiad
ayounsi added a comment to T232007: Restbase: significant increase of outbound dropped packets.

And it's alerting again...
https://grafana.wikimedia.org/d/000000366/network-performances-global?panelId=21&fullscreen&edit&tab=alert&orgId=1&from=1568895095253&to=1568912301774

Thu, Sep 19, 9:36 PM · service-runner, RESTBase, User-mobrovac, Core Platform Team Workboards (Clinic Duty Team)
ayounsi closed Restricted Task, a subtask of T222109: decommission frav1001.frack.eqiad.wmnet, as Resolved.
Thu, Sep 19, 8:17 PM · decommission, Operations, fundraising-tech-ops, ops-eqiad, DC-Ops
ayounsi closed Restricted Task, a subtask of T232029: synchronize frmon1001:/var/lib/grafana to frmon2001:/var/lib/grafana, as Resolved.
Thu, Sep 19, 8:17 PM · fundraising-tech-ops
ayounsi closed Restricted Task, a subtask of T233328: set up cross-host backups between frmon1001 and frmon2001, as Resolved.
Thu, Sep 19, 8:17 PM · fundraising-tech-ops
ayounsi added a comment to T232602: GRE MTU mitigations - Tracking.
  • Setting tcp-mss on an interface causes all the BGP sessions going over that interface to bounce
  • As eqiad and codfw exchange a full view, some outbound eqiad traffic goes through codfw so we should clamp codfw/codfw too, but as it's very little traffic it might not be worth it. We might want to isolate eqiad/codfw more too later on.
Thu, Sep 19, 6:30 PM · Traffic, Operations
ayounsi added a comment to T229682: Add more dimensions to netflow's druid ingestion specs.

https://gerrit.wikimedia.org/r/c/operations/puppet/+/531752 has been merged after talking to Luca, confirmed using kafkacat that the data is there.

Thu, Sep 19, 5:32 PM · Analytics-Kanban, Analytics
ayounsi triaged T233336: Add urlshortener to Turnilo as Low priority.
Thu, Sep 19, 5:11 PM · Analytics
ayounsi triaged T233318: scs monitoring missing in Icinga as Normal priority.
Thu, Sep 19, 3:14 PM · Icinga, observability, Operations

Wed, Sep 18

ayounsi added a comment to T229682: Add more dimensions to netflow's druid ingestion specs.

LGTM! Thanks!

Wed, Sep 18, 11:34 PM · Analytics-Kanban, Analytics
ayounsi closed T196432: Configure interface damping on primary links as Resolved.

All primary link of all transport pairs have now damping configured.

Wed, Sep 18, 9:16 PM · Wikimedia-Incident, Operations, Traffic, netops
ayounsi closed T228827: Instability of the Level3 link between cr2-eqiad and cr2-esams as Resolved.

From Level3:

I appreciate your patience while we worked on gathering the data on these repair tickets. I’ve attached the repair ticket log above for you.
Unfortunately, there was no chronic issue when researching this circuit. The circuit has been impacted by multiple planned maintenances and higher-level network events. They have all been different troubles that have been restored when the problem was brought to our attention.
We understand the negative impact this has on your business and the need to mitigate the down time. With the circuit being down so many time we would expect to see some type of chronic issue but this was really an outliers with so many issues that impacted the service. Let me know if you have any questions of if I can provide further assistance.

Wed, Sep 18, 8:34 PM · Operations, netops
ayounsi reopened T227539: b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC), a subtask of T226778: Install new PDUs in rows A/B (Top level tracking task), as Open.
Wed, Sep 18, 6:55 PM · DC-Ops, Operations, ops-eqiad
ayounsi reopened T227539: b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC) as "Open".

The Netbox/LibreNMS check is not happy: https://netbox.wikimedia.org/extras/reports/librenms.LibreNMS/
Did Netbox get updated with the new serial?

Wed, Sep 18, 6:55 PM · DC-Ops, Operations, ops-eqiad
ayounsi triaged T233248: Power issue in eqiad A1 as High priority.
Wed, Sep 18, 6:32 PM · Operations, ops-eqiad
ayounsi closed T220639: Show IPs matching a list of IP subnets in Webrequest data as Resolved.

All good here. Thanks!

Wed, Sep 18, 4:01 PM · User-Elukey, Analytics
ayounsi added a comment to T229682: Add more dimensions to netflow's druid ingestion specs.

Adding the following to the ones you listed should cover most of the cases:

{'49': 'ACK+URG+FIN', '48': 'ACK+URG', '40': 'PSH+URG', '36': 'RST+URG', '34': 'SYN+URG', '33': 'FIN+URG', '29': 'PSH+ACK+RST+FIN', '28': 'PSH+ACK+RST', '26': 'PSH+ACK+SYN', '25': 'PSH+ACK+FIN', '24': 'PSH+ACK', '22': 'RST+ACK+SYN', '21': 'RST+ACK+FIN', '20': 'RST+ACK', '19': 'SYN+ACK+FIN', '18': 'SYN+ACK', '17': 'FIN+ACK', '12': 'RST+PSH', '10': 'SYN+PSH', '9': 'FIN+PSH', '6': 'SYN+RST', '5': 'FIN+RST', '3': 'FIN+SYN'}

Wed, Sep 18, 2:16 AM · Analytics-Kanban, Analytics
ayounsi added a comment to T229682: Add more dimensions to netflow's druid ingestion specs.

Thanks for looking into that that!

Wed, Sep 18, 12:09 AM · Analytics-Kanban, Analytics

Tue, Sep 17

ayounsi closed T233075: Review firewall rules for labpuppetmaster1001/labpuppetmaster1002 removal as Resolved.

No mention of those two hosts (or their IPs) in Rancid (network devices).

Tue, Sep 17, 10:08 PM · Operations, netops
ayounsi added a comment to T233129: update puppet for new PDU models.

Nah, Sentry Smart PDU Version 8.0n are still Sentry 4.
I think the gap is at v7 = Sentry 3, v8 = Sentry 4

Tue, Sep 17, 5:11 PM · DC-Ops, Operations, ops-eqiad
ayounsi added a comment to T148541: Replace Torrus with Prometheus snmp_exporter for PDUs monitoring.

Note that the data is in LibreNMS as well, but with some limitations:

  • 5min granularity
  • Not possible to stack or sum graphs (each power graph is independent)
Tue, Sep 17, 3:09 PM · User-fgiunchedi, Patch-For-Review, Prometheus-metrics-monitoring, observability, Operations
ayounsi created P9118 power_password.py.
Tue, Sep 17, 1:56 PM

Mon, Sep 16

ayounsi closed T232617: BGP sessions down on cr2-esams as Resolved.

@jbond solved AS28598

Mon, Sep 16, 8:56 PM · Operations, netops
ayounsi added a comment to T232617: BGP sessions down on cr2-esams.

From AMS-IX ML:

Please remove your BGP sessions to the following IP’s:
AS12871
IPv4: 80.249.208.64
IPv6: 2001:7f8:1::a501:2871:1

Mon, Sep 16, 8:54 PM · Operations, netops
ayounsi added projects to T233047: Apache mod_status aggregator: Operations, Core Platform Team.
Mon, Sep 16, 8:06 PM · observability, Operations
ayounsi closed T232977: librenms doesn't print alert text on irc anymore as Resolved.

Confirmed fixed!

Mon, Sep 16, 8:01 PM · Operations, netops
ayounsi added a comment to T232977: librenms doesn't print alert text on irc anymore.

Pushed https://gerrit.wikimedia.org/r/c/operations/puppet/+/537182 to fix some email alerting issues, not sure yet if it helps with the IRC ones.

Mon, Sep 16, 7:29 PM · Operations, netops

Sep 13 2019

ayounsi added a comment to T227541: b6-eqiad pdu refresh (Tuesday 9/10 @11am UTC).

Trying to figure out why this is failing: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=ps1-b6-eqiad
error is:

External command error: Error in packet
Reason: (noSuchName) There is no such variable name in this MIB.
Failed object: iso.3.6.1.4.1.1718.3.2.2.1.7.1.1

Sep 13 2019, 8:12 PM · DC-Ops, Operations, ops-eqiad
ayounsi added a comment to T232602: GRE MTU mitigations - Tracking.

As discussed on IRC, this *should* work for inbound (clamping the SYNACK too), but to be tested.

Sep 13 2019, 4:18 PM · Operations, Traffic
ayounsi added a comment to T232602: GRE MTU mitigations - Tracking.

Note that with new eqiad routing engines we can set the MSS at the router level (untested).
Advantages are: easier to deploy (one configuration change) and can be applied to external flows only, not all flows in/out of a server.
All DCs except esams should support it for now. (esams after the refresh).

Sep 13 2019, 3:06 PM · Operations, Traffic

Sep 12 2019

ayounsi added a comment to T232711: Deploy ripe-atlas-tools for ad-hoc network tests.

LGTM!

Sep 12 2019, 7:11 PM · Operations, netops, observability
ayounsi closed T226424: (Need By: Sept 30) update RE-S-X6-64G-S in cr[12]-eqiad, a subtask of T196432: Configure interface damping on primary links, as Resolved.
Sep 12 2019, 6:14 PM · Wikimedia-Incident, Operations, Traffic, netops
ayounsi closed T226424: (Need By: Sept 30) update RE-S-X6-64G-S in cr[12]-eqiad as Resolved.

Alright everything here is done. And was quite smooth.
Some notes:

  • k8s1005 and k8s1006 only had v4/v6 sessions to cr1 and not cr2, which caused this page

PROBLEM - LVS HTTP IPv4 #page on sessionstore.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds

Was fixed quickly and the service is not in prod yet

  • VRRP failover triggered the following for eqiad/codfw/ulsfo, Not ideal but not critical neither

PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo

  • text LVS and its backup (1016) LVS are on the same cr1, needs to have T180069 in prod urgently as it's a SPOF
  • CF tunnel failover worked as expected
Sep 12 2019, 6:14 PM · Operations, netops, ops-eqiad
ayounsi updated the task description for T226424: (Need By: Sept 30) update RE-S-X6-64G-S in cr[12]-eqiad.
Sep 12 2019, 3:10 PM · Operations, netops, ops-eqiad
ayounsi updated the task description for T226424: (Need By: Sept 30) update RE-S-X6-64G-S in cr[12]-eqiad.
Sep 12 2019, 2:30 PM · Operations, netops, ops-eqiad

Sep 11 2019

ayounsi updated subscribers of T232007: Restbase: significant increase of outbound dropped packets.

This is back to much better values as of 16:50 UTC for codfw and 18:40 for eqiad. Still higher than before the main increase though.
From SAL it matches:
16:39 urandom: decommissioning Cassandra, restbase1018-a -- T224553
18:43 urandom: decommissioning Cassandra, restbase1018-b -- T224553
/cc @Eevans

Sep 11 2019, 11:19 PM · service-runner, RESTBase, User-mobrovac, Core Platform Team Workboards (Clinic Duty Team)
ayounsi added a comment to T229682: Add more dimensions to netflow's druid ingestion specs.

I just though i can easily setup turnilo to decode tcp_flags so they are not ints, let me give it a try

Sep 11 2019, 10:22 PM · Analytics-Kanban, Analytics
ayounsi added a comment to T232412: HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed.
  • "When exactly did those errors start appearing?

2019-09-07T03:43:55 UTC

  • "How many occurrences of these have you seen, and when?

About one flap every 5 minutes.

  • "Is the problem still ongoing?

Yes

  • "Can you please provide a traceroute, mtr, or some form of connection trace from your side. At the very least the source network IP and timestamp."
ayounsi@icinga1001:~$ mtr -z -n --report-wide blog.wikimedia.org
Start: Wed Sep 11 15:10:33 2019
HOST: icinga1001               Loss%   Snt   Last   Avg  Best  Wrst StDev
  1. AS14907  208.80.154.67     0.0%    10    0.7   0.5   0.4   0.8   0.0
  2. AS???    206.126.237.124   0.0%    10    1.5   0.9   0.5   1.7   0.0
  3. AS2635   198.181.119.95    0.0%    10    0.8   0.7   0.5   1.1   0.0
  4. AS???    100.68.10.7       0.0%    10    0.7   0.7   0.5   0.9   0.0
  5. AS2635   192.0.79.33       0.0%    10    0.5   0.5   0.5   0.6   0.0

Source is 208.80.154.84

Sep 11 2019, 3:12 PM · Wikimedia-Blog
ayounsi added a comment to T232617: BGP sessions down on cr2-esams.

I think it's safe to delete 28598 if they don't reply to your most recent email.

Sep 11 2019, 3:01 PM · Operations, netops
ayounsi added a comment to T230005: BGP session down for AS4739 on cr4-ulsfo.

Yep! I can walk you through it if needed.

Sep 11 2019, 2:59 PM · netops, Operations

Sep 10 2019

ayounsi added a comment to T232491: Numerous people reporting issues saving edits and viewing previews/diffs.

Thanks for the reports, we have narrowed down the cause to a MTU issue on our side.

Sep 10 2019, 10:38 PM · netops, Traffic, Wikimedia-General-or-Unknown, Operations
ayounsi added a comment to T232226: wmf_netflow cube in Turnilo missing bytes and packets measures.

A quick look shows everything looking fine.

Sep 10 2019, 4:28 PM · Patch-For-Review, Analytics-Kanban, Analytics

Sep 9 2019

ayounsi added a comment to T232349: Disable production EventLogging analytics MySQL consumers.

This look related,
https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=eventlog1002&service=Check+systemd+state
CRITICAL - degraded: The system is operational but one or more units failed.

eventlog1002:~$ sudo systemctl
[...]
● eventlogging-consumer@mysql-eventbus.service not-found failed failed    eventl
Sep 9 2019, 11:45 PM · Analytics, Analytics-EventLogging
ayounsi created T232412: HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed.
Sep 9 2019, 10:58 PM · Wikimedia-Blog
ayounsi created P9065 (An Untitled Masterwork).
Sep 9 2019, 2:42 PM

Sep 6 2019

ayounsi added a comment to T226424: (Need By: Sept 30) update RE-S-X6-64G-S in cr[12]-eqiad.

Postponed to Thursday Sept 12th, 8am PST, 11am local time, 15:00 UTC. 3h

Sep 6 2019, 3:11 PM · Operations, netops, ops-eqiad