Page MenuHomePhabricator

an-worker1165.eqiad.wmnet and increased network activity resulting in page on May 13 2024
Closed, ResolvedPublic

Assigned To
Authored By
ssingh
Tue, May 14, 5:47 PM
Referenced Files
F53341500: image.png
Wed, May 15, 3:27 PM
F53337023: image.png
Wed, May 15, 2:35 PM
F53225464: image.png
Tue, May 14, 9:56 PM
F53224464: image.png
Tue, May 14, 9:56 PM
Tokens
"Mountain of Wealth" token, awarded by Dzahn.

Description

Yesterday, we got a page for (times in UTC):

18:27:46 <+jinxer-wm> FIRING: Primary outbound port utilisation over 80%  #page: Alert for device asw2-c-eqiad.mgmt.eqiad.wmnet - 
                      Primary outbound port utilisation over 80%  #page   - 
                      https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page

On some inspection and thanks to a recommendation by @CDanis, the suggestion was that internal traffic and related to the Analytics cluster. network_flows_internal takes a while to catch up on which host it was but it seems like it was an-worker1165.eqiad.wmnet and 10.64.157.4. You can see it in Turnilo at https://w.wiki/A4ym.

@Dzahn noticed that there were kafka-jumbo restarts around the same time and the alert pointed to port 50010 which shows up in Puppet as Hadoop data-note.

The alert self-resolved which makes sense as the traffic started going down but I am filing this task to see if we can make sense of what happened here so that we can prevent it from happening again, if required and if that makes sense.

Thanks!

Event Timeline

To add some context:

The ports that saturated weren't ports for individual machines on the access switches. Those don't page.

They were a few of the connections between the access switches switches and the core routers:

I think that this level of utilization on such a shared link can potentially cause user-facing issues with live traffic as well.

Thanks for the task and analysis.

it seems like it was an-worker1165.eqiad.wmnet and 10.64.157.4

Certainly that host had a very large amount of traffic sent to it, maxing it's 10Gb/sec port with over 800kpps:

image.png (708×1 px, 204 KB)

In the same timeframe we see a high amount of traffic sent by an-worker1164, another host in rack E7 (and seen in Turnilo here)

image.png (631×1 px, 111 KB)

Those specific hosts, while standing out in Turnilo, may not have been the main cause of the alerts relating to the CR uplinks from rows A and C. But it does seem have been an-worker/an-presto nodes in general.

@Dzahn noticed that there were kafka-jumbo restarts around the same time and the alert pointed to port 50010 which shows up in Puppet as Hadoop data-note.

Turnilo suggests that was indeed the port at play. Analysing the top talkers to/from that port over the time period is quite interesting. The busiest sources and destinations are all in the new rows, for instance top response flows from source port 50010 over that time:

SenderSender RackDestinationDestination RackGBytes Total
an-worker1164E7an-worker1165E7579
an-worker1163E7an-worker1165E7406
an-worker1162E6an-worker1160E6358
an-worker1161E6an-worker1160E6357
an-worker1172F7an-worker1174F7283
an-worker1173F7an-worker1174F7173
an-worker1169F6an-worker1170F6160
an-worker1142E1an-worker1147E1132
an-worker1153E1an-worker1147E1119
an-worker1169F6an-worker1171F692

No hosts in the top 100 responders (i.e. traffic from port 50010) are In rows A and C, but I think that is probably just due to the volume being sent within the new rows topping the chart. Looking at requests (traffic to port 50010) we do see some hosts in those rows appearing, although the volume is an order of magnitude lower than those in the new rows. Still it is likely the responses to the below requests account for the increased usage on the links in the alerts:

SenderSender RackDestinationDestination RackGBytes Total
an-worker1143E2an-worker1082A73
an-worker1167F5an-worker1089C42
an-worker1144F1an-worker1120A42
an-worker1161E6an-worker1119A22
an-worker1153E1an-worker1079A22
an-worker1174F7an-worker1141A42
an-worker1143E2an-worker1102A42
an-worker1167F5an-worker1078A22
an-worker1143E2an-worker1100C42
an-worker1169F6an-worker1133C72

In terms of what this traffic is or why it kicked off I've no idea. I assume this is due to the kafka-jumbo restarts. In and of itself it's not that problematic, and we don't have other indications of user problems that it caused, but we can't say for sure that we didn't at least have some performance degragation.

I sound like a broken record on QoS but I am back working on it so we will hopefully have it in place this quarter on the network. I think this Hadoop / port 50010 traffic would be a good candidate to put into the "scavenger" traffic class, which will mean it can still go mad like this if needed, but other more important user traffic will take precident if there is congestion.

The other thing that will help is when we begin to upgrade the switches in rows A-D in Eqiad. The new design provides more bandwidth between racks and also (when complete) means inter-rack traffic does not route through the core routers. Looking at the links in rows E/F here for instance there were no drops or even enough usage to make the links very busy.

From my point of view I don't think any other immediate actions are needed, and we can perhaps close this task. That said it might be good if the team(s) who manage these services can comment in terms of if this was expected or how often such events should be expected.

Quite note on what the traffic this task refers is.
Port 50010 on an-worker nodes is hadoop HDFS datanodes data transfer port.
This is the port serving data requested to HDFS served by that worker. Most of the traffic on this port is expected to be coming either from same/other an-worker nodes, or from an-presto nodes (both are compute-nodes reading HDFS data).
The amount of data read varies a lot depending on jobs. One can see that the amount of data read from HDFS is very spiky: https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&var-hadoop_cluster=analytics-hadoop&from=now-7d&to=now&viewPanel=111

I don't know why that specific high-usage moment lead to a page while some others have not though.
Anyhow, +1 on setting up QoS for lower-priority traffic (Hadoop/presto for instance :).

The amount of data read varies a lot depending on jobs. One can see that the amount of data read from HDFS is very spiky: https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&var-hadoop_cluster=analytics-hadoop&from=now-7d&to=now&viewPanel=111

Thanks, the big surge in that graph matches what we observed on the network.

I don't know why that specific high-usage moment lead to a page while some others have not though.

It's all related to the exact level of usage, the specific hosts involved, and exactly where they are located. If any of our core links reach 80% utilization for more than a few minutes it will trigger. So one assumes the other times we saw high utilization that didn't quite trip the threshold.

Anyhow, +1 on setting up QoS for lower-priority traffic (Hadoop/presto for instance :).

Great, I will touch base when we have the core set up to do this, it should be simple to enable via puppet.

Gonna close this one. As a last datapoint if you 'stack' the Hadoop graph in Grafana you can clearly see the cumulative reads at ~15:55 on May 14th was a good deal higher than any of the other spikes of usage over the past few days (peaks at almost 200Gbit/sec). So it makes sense that paged and the others didn't.

image.png (497×1 px, 132 KB)

I think that the most likely candidate at the moment is user-generated load from someone doing a heavy query in Superset, thereby pulling lots of data from HDFS, via the presto workers.

Here is a stacked graph of the presto workers' network throughput at around the same time and we can see a pretty sustained transfer to all fifteen presto workers, all pulling data from hadoop workers, as fast as possible.

image.png (961×1 px, 194 KB)

We could probably track down who was logged into Superset at the time and what they were running but it's probably not worth trying to dig too deeply, in retrospect.
Unfortunately, we don't have a presto query logger yet (See: T269832: Add a presto query logger) which would allow us to have more precise details about the data transfers caused by particular queries.

However (if this hypothesis is right) then it's useful to know that user-generated load on presto/HDFS can cause paging alerts. If we get similar alerts in future, then we will probably have to raise the priority of the QoS work, or find other ways to apply limits.

I wonder if there would be a way to apply different QoS policies for 1) the more steady-state production pipelines that are under our control and 2) the more spiky user-generated load. Just a thought.
I'm also looking forward to that switch upgrade in eqiad rows A-D.

I think that the most likely candidate at the moment is user-generated load from someone doing a heavy query in Superset, thereby pulling lots of data from HDFS, via the presto workers.

Thanks for the added info Ben.

In general this kind of worrys me. We've multiple terabytes of data moving in a 15 minute period here, and hundreds of gigabit/sec peak transfer rate. That probably makes it the largest consumer of network resources we have.

If that's a scheduled, required job then there is no problem, we need to build and scale the network to accommodate our applications. But if it's just random user interaction, where the person working in Superset might trigger it inadvertently when interacting with a dashboard, it's a worry.

Unfortunately, we don't have a presto query logger yet (See: T269832: Add a presto query logger) which would allow us to have more precise details about the data transfers caused by particular queries.

Any additional insight would be good. I wonder - if we identified the kind of queries that caused this - could any constraints be put in place on the Superset side to prevent it happening normally?

If we get similar alerts in future, then we will probably have to raise the priority of the QoS work, or find other ways to apply limits.

QoS may allow us to tweak the exact parameters of the alerts, but I'm not sure we can get rid of them completely. The aim of QoS is to ensure more important traffic is unaffected during occasional out-of-profile spikes in usage. But we don't want a situation where link saturation happens regularly (that can lead to bufferbloat and increased latencies regardless of QoS). Nor can we assume an alert for link saturation is due to Hadoop or any previously-identified legitimate traffic.

If we are saturating links, the real answer is to add capacity. In that respect the first two rows in Eqiad will be done this calendar year, with the remaining two in 2025.

I wonder if there would be a way to apply different QoS policies for 1) the more steady-state production pipelines that are under our control and 2) the more spiky user-generated load. Just a thought.

We can certainly look at it. That said it's generally ok to place all bulk-data flows in the low priority class. Under normal circumstances that won't affect performance at all, and indeed allows it to get as much bandwidth as is available without static limits. If we feel we need to differentiate between different sub-types of usage we can of course take a look. Let's review again when we are further along with QoS.