an-worker1165.eqiad.wmnet and increased network activity resulting in page on May 13 2024
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ssingh
	Tue, May 14, 5:47 PM

Description

Yesterday, we got a page for (times in UTC):

18:27:46 <+jinxer-wm> FIRING: Primary outbound port utilisation over 80%  #page: Alert for device asw2-c-eqiad.mgmt.eqiad.wmnet - 
                      Primary outbound port utilisation over 80%  #page   - 
                      https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page

On some inspection and thanks to a recommendation by @CDanis, the suggestion was that internal traffic and related to the Analytics cluster. network_flows_internal takes a while to catch up on which host it was but it seems like it was an-worker1165.eqiad.wmnet and 10.64.157.4. You can see it in Turnilo at https://w.wiki/A4ym.

@Dzahn noticed that there were kafka-jumbo restarts around the same time and the alert pointed to port 50010 which shows up in Puppet as Hadoop data-note.

The alert self-resolved which makes sense as the traffic started going down but I am filing this task to see if we can make sense of what happened here so that we can prevent it from happening again, if required and if that makes sense.

Thanks!

Related Objects

Mentioned In: T288252: Alluxio for Improved Superset Query Performance
Mentioned Here: T269832: Add a presto query logger
E6: Phabricator Upgrade

Event Timeline

ssingh created this task.Tue, May 14, 5:47 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptTue, May 14, 5:47 PM

To add some context:

The ports that saturated weren't ports for individual machines on the access switches. Those don't page.

They were a few of the connections between the access switches switches and the core routers:

I think that this level of utilization on such a shared link can potentially cause user-facing issues with live traffic as well.

Restricted Application added a project: Infrastructure-Foundations. · View Herald TranscriptTue, May 14, 6:00 PM

RKemper subscribed.Tue, May 14, 6:15 PM

Thanks for the task and analysis.

it seems like it was an-worker1165.eqiad.wmnet and 10.64.157.4

Certainly that host had a very large amount of traffic sent to it, maxing it's 10Gb/sec port with over 800kpps:

In the same timeframe we see a high amount of traffic sent by an-worker1164, another host in rack E7 (and seen in Turnilo here)

Those specific hosts, while standing out in Turnilo, may not have been the main cause of the alerts relating to the CR uplinks from rows A and C. But it does seem have been an-worker/an-presto nodes in general.

@Dzahn noticed that there were kafka-jumbo restarts around the same time and the alert pointed to port 50010 which shows up in Puppet as Hadoop data-note.

Turnilo suggests that was indeed the port at play. Analysing the top talkers to/from that port over the time period is quite interesting. The busiest sources and destinations are all in the new rows, for instance top response flows from source port 50010 over that time:

Sender	Sender Rack	Destination	Destination Rack	GBytes Total
an-worker1164	E7	an-worker1165	E7	579
an-worker1163	E7	an-worker1165	E7	406
an-worker1162	E6	an-worker1160	E6	358
an-worker1161	E6	an-worker1160	E6	357
an-worker1172	F7	an-worker1174	F7	283
an-worker1173	F7	an-worker1174	F7	173
an-worker1169	F6	an-worker1170	F6	160
an-worker1142	E1	an-worker1147	E1	132
an-worker1153	E1	an-worker1147	E1	119
an-worker1169	F6	an-worker1171	F6	92

No hosts in the top 100 responders (i.e. traffic from port 50010) are In rows A and C, but I think that is probably just due to the volume being sent within the new rows topping the chart. Looking at requests (traffic to port 50010) we do see some hosts in those rows appearing, although the volume is an order of magnitude lower than those in the new rows. Still it is likely the responses to the below requests account for the increased usage on the links in the alerts:

Sender	Sender Rack	Destination	Destination Rack	GBytes Total
an-worker1143	E2	an-worker1082	A7	3
an-worker1167	F5	an-worker1089	C4	2
an-worker1144	F1	an-worker1120	A4	2
an-worker1161	E6	an-worker1119	A2	2
an-worker1153	E1	an-worker1079	A2	2
an-worker1174	F7	an-worker1141	A4	2
an-worker1143	E2	an-worker1102	A4	2
an-worker1167	F5	an-worker1078	A2	2
an-worker1143	E2	an-worker1100	C4	2
an-worker1169	F6	an-worker1133	C7	2

In terms of what this traffic is or why it kicked off I've no idea. I assume this is due to the kafka-jumbo restarts. In and of itself it's not that problematic, and we don't have other indications of user problems that it caused, but we can't say for sure that we didn't at least have some performance degragation.

I sound like a broken record on QoS but I am back working on it so we will hopefully have it in place this quarter on the network. I think this Hadoop / port 50010 traffic would be a good candidate to put into the "scavenger" traffic class, which will mean it can still go mad like this if needed, but other more important user traffic will take precident if there is congestion.

The other thing that will help is when we begin to upgrade the switches in rows A-D in Eqiad. The new design provides more bandwidth between racks and also (when complete) means inter-rack traffic does not route through the core routers. Looking at the links in rows E/F here for instance there were no drops or even enough usage to make the links very busy.

From my point of view I don't think any other immediate actions are needed, and we can perhaps close this task. That said it might be good if the team(s) who manage these services can comment in terms of if this was expected or how often such events should be expected.

bking subscribed.Tue, May 14, 10:13 PM

Gehel edited projects, added Data-Platform-SRE (2024.05.06 - 2024.05.26); removed Data-Platform-SRE.Wed, May 15, 7:45 AM

Quite note on what the traffic this task refers is.
Port 50010 on an-worker nodes is hadoop HDFS datanodes data transfer port.
This is the port serving data requested to HDFS served by that worker. Most of the traffic on this port is expected to be coming either from same/other an-worker nodes, or from an-presto nodes (both are compute-nodes reading HDFS data).
The amount of data read varies a lot depending on jobs. One can see that the amount of data read from HDFS is very spiky: https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&var-hadoop_cluster=analytics-hadoop&from=now-7d&to=now&viewPanel=111

I don't know why that specific high-usage moment lead to a page while some others have not though.
Anyhow, +1 on setting up QoS for lower-priority traffic (Hadoop/presto for instance :).

In T364893#9799598, @JAllemandou wrote:

The amount of data read varies a lot depending on jobs. One can see that the amount of data read from HDFS is very spiky: https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&var-hadoop_cluster=analytics-hadoop&from=now-7d&to=now&viewPanel=111

Thanks, the big surge in that graph matches what we observed on the network.

I don't know why that specific high-usage moment lead to a page while some others have not though.

It's all related to the exact level of usage, the specific hosts involved, and exactly where they are located. If any of our core links reach 80% utilization for more than a few minutes it will trigger. So one assumes the other times we saw high utilization that didn't quite trip the threshold.

Anyhow, +1 on setting up QoS for lower-priority traffic (Hadoop/presto for instance :).

Great, I will touch base when we have the core set up to do this, it should be simple to enable via puppet.

BTullis claimed this task.Wed, May 15, 2:30 PM

BTullis moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.05.06 - 2024.05.26) board.

Gonna close this one. As a last datapoint if you 'stack' the Hadoop graph in Grafana you can clearly see the cumulative reads at ~15:55 on May 14th was a good deal higher than any of the other spikes of usage over the past few days (peaks at almost 200Gbit/sec). So it makes sense that paged and the others didn't.

Dzahn awarded a token.Wed, May 15, 3:24 PM

I think that the most likely candidate at the moment is user-generated load from someone doing a heavy query in Superset, thereby pulling lots of data from HDFS, via the presto workers.

Here is a stacked graph of the presto workers' network throughput at around the same time and we can see a pretty sustained transfer to all fifteen presto workers, all pulling data from hadoop workers, as fast as possible.

We could probably track down who was logged into Superset at the time and what they were running but it's probably not worth trying to dig too deeply, in retrospect.
Unfortunately, we don't have a presto query logger yet (See: T269832: Add a presto query logger) which would allow us to have more precise details about the data transfers caused by particular queries.

However (if this hypothesis is right) then it's useful to know that user-generated load on presto/HDFS can cause paging alerts. If we get similar alerts in future, then we will probably have to raise the priority of the QoS work, or find other ways to apply limits.

I wonder if there would be a way to apply different QoS policies for 1) the more steady-state production pipelines that are under our control and 2) the more spiky user-generated load. Just a thought.
I'm also looking forward to that switch upgrade in eqiad rows A-D.

In T364893#9800673, @BTullis wrote:

I think that the most likely candidate at the moment is user-generated load from someone doing a heavy query in Superset, thereby pulling lots of data from HDFS, via the presto workers.

Thanks for the added info Ben.

In general this kind of worrys me. We've multiple terabytes of data moving in a 15 minute period here, and hundreds of gigabit/sec peak transfer rate. That probably makes it the largest consumer of network resources we have.

If that's a scheduled, required job then there is no problem, we need to build and scale the network to accommodate our applications. But if it's just random user interaction, where the person working in Superset might trigger it inadvertently when interacting with a dashboard, it's a worry.

Unfortunately, we don't have a presto query logger yet (See: T269832: Add a presto query logger) which would allow us to have more precise details about the data transfers caused by particular queries.

Any additional insight would be good. I wonder - if we identified the kind of queries that caused this - could any constraints be put in place on the Superset side to prevent it happening normally?

If we get similar alerts in future, then we will probably have to raise the priority of the QoS work, or find other ways to apply limits.

QoS may allow us to tweak the exact parameters of the alerts, but I'm not sure we can get rid of them completely. The aim of QoS is to ensure more important traffic is unaffected during occasional out-of-profile spikes in usage. But we don't want a situation where link saturation happens regularly (that can lead to bufferbloat and increased latencies regardless of QoS). Nor can we assume an alert for link saturation is due to Hadoop or any previously-identified legitimate traffic.

If we are saturating links, the real answer is to add capacity. In that respect the first two rows in Eqiad will be done this calendar year, with the remaining two in 2025.

I wonder if there would be a way to apply different QoS policies for 1) the more steady-state production pipelines that are under our control and 2) the more spiky user-generated load. Just a thought.

We can certainly look at it. That said it's generally ok to place all bulk-data flows in the low priority class. Under normal circumstances that won't affect performance at all, and indeed allows it to get as much bandwidth as is available without static limits. If we feel we need to differentiate between different sub-types of usage we can of course take a look. Let's review again when we are further along with QoS.

Gehel mentioned this in T288252: Alluxio for Improved Superset Query Performance.Fri, May 24, 1:28 PM

	F53341500: image.png
	Wed, May 15, 3:27 PM

	F53337023: image.png
	Wed, May 15, 2:35 PM

	F53225464: image.png
	Tue, May 14, 9:56 PM

	F53224464: image.png
	Tue, May 14, 9:56 PM

an-worker1165.eqiad.wmnet and increased network activity resulting in page on May 13 2024Closed, ResolvedPublicActions

Description

Related Objects

Event Timeline

an-worker1165.eqiad.wmnet and increased network activity resulting in page on May 13 2024
Closed, ResolvedPublic
Actions