Page MenuHomePhabricator

Packetloss issues on oxygen (and analytics1003)
Closed, ResolvedPublic

Description

It seems oxygen and analytics are having packetloss issues.
Currently >10%.

From http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/20140708.txt:

[20:50:35] <icinga-wm> PROBLEM - Packetloss_Average on analytics1003 is CRITICAL: packet_loss_average CRITICAL: 15.3968160504
[...]
[20:51:55] <icinga-wm> PROBLEM - Packetloss_Average on oxygen is CRITICAL: packet_loss_average CRITICAL: 12.8006601681
[...]
[21:05:53] <icinga-wm> RECOVERY - Packetloss_Average on oxygen is OK: packet_loss_average OKAY: 2.44271184874
[...]
[21:10:33] <icinga-wm> RECOVERY - Packetloss_Average on analytics1003 is OK: packet_loss_average OKAY: 1.16412831933
[...]
[21:35:57] <icinga-wm> PROBLEM - Packetloss_Average on oxygen is CRITICAL: packet_loss_average CRITICAL: 10.1998965254
[...]
[21:40:33] <icinga-wm> PROBLEM - Packetloss_Average on analytics1003 is CRITICAL: packet_loss_average CRITICAL: 12.2718278814
[...]


Version: unspecified
Severity: normal
Whiteboard: u=Kevin c=General/Unknown p=0 s=2014-07-24

Details

Reference
bz67694

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:27 AM
bzimport set Reference to bz67694.

Will write here anything I find. So it looks like packets into Analytics1003 increased by an unusual margin starting around 20:30 and going through 22:00 at the time of this comment:

http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20eqiad&h=analytics1003.eqiad.wmnet&r=hour&z=default&jr=&js=&st=1404856260&v=859712030&m=UDP_InDatagrams&vl=packets&ti=UDP%20Packets%20Received&z=large

Notice the large spike over 50K, unique of its kind in the weekly graph. And it seems in response, packets are now being dropped starting at roughly the same time, 20:30 (search for packet_loss in the page):

http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20eqiad&h=analytics1003.eqiad.wmnet&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2

(and notice the timing in one specific example):

http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20eqiad&h=analytics1003.eqiad.wmnet&r=hour&z=default&jr=&js=&st=1404856260&v=6.22941&m=packet_loss_90th_eqiad_text_cp&vl=%25&z=large

according to puppet, analytics1003 is set up with role::analytics::kafkatee::webrequest::mobile, but I'm not too familiar with the infrastructure.

My theory so far is that the FIFA World Cup semifinals is increasing traffic on our servers and that the logging infrastructure can't keep up. I am not currently watching the match so please don't spoil it for me in any follow up or I will be cross with you.

I will investigate more tomorrow, and watch the ops channel now for any new info.

We've had some limited reports that some deleted files are being purged, which could be caused by packetloss of the htcp packets. I wonder if these issues are related. (bug 67559)

Other than looking at actual pages requested, I couldn't find any direct evidence from the daily PV stats that the RPS spike was caused by a surge of interest in the Brazil game (it’s definitely possible that the spike got diluted over the course of a 24h period).

http://stats.grok.se/en/latest/Brazil_v_Germany_(2014_FIFA_World_Cup)
http://stats.grok.se/en/latest/Brazil_national_football_team
http://stats.grok.se/en/latest/Germany_national_football_team
http://stats.grok.se/en/latest/2014_FIFA_World_Cup
http://stats.grok.se/en/latest/FIFA_World_Cup

Over the course of the week, the 2014 FIFA World Cup was the most viewed article on enwiki

http://tools.wmflabs.org/wikitrends/english-most-visited-this-week.html

Just writing some notes for myself on how to troubleshoot this. For now, I'm putting it on hold until I get two more things:

  1. kafka broker data is available to query again (right now we have 1 week rolling data on the brokers but they're not writing to hdfs because the cluster is being reinstalled with CDH5).
  2. data from tonight's game to see how it compares

So, Christian was nice enough to walk me through puppet and I will put this all on a wiki when I'm done, but for now:

Oxygen and what it does was defined in site.pp (which is where you can go to start searching for most host names - beware some are matched by regular expression, analytics10.. for example). Note, comments are NOT trustworthy, puppet code is:

http://git.wikimedia.org/blob/operations%2Fpuppet.git/9b9d473fa4bf02a63e35c340ec971907aab1da1d/manifests%2Fsite.pp#L2181

The roles it uses are to be found in manifests/roles/, and in this case logging.pp:

http://git.wikimedia.org/blob/operations%2Fpuppet.git/9b9d473fa4bf02a63e35c340ec971907aab1da1d/manifests%2Frole%2Flogging.pp#L272

And notice that role refers to other roles in the same file, but in general, track down all the roles and parent classes that your instance's definition mentions. Since Oxygen also uses misc::udp2log:

http://git.wikimedia.org/blob/operations%2Fpuppet.git/9b9d473fa4bf02a63e35c340ec971907aab1da1d/manifests%2Fmisc%2Fudp2log.pp

We notice the filter template is referenced here:

http://git.wikimedia.org/blob/operations%2Fpuppet.git/9b9d473fa4bf02a63e35c340ec971907aab1da1d/manifests%2Fmisc%2Fudp2log.pp#L116

And found here:

http://git.wikimedia.org/blob/operations%2Fpuppet.git/9b9d473fa4bf02a63e35c340ec971907aab1da1d/templates%2Fudp2log%2Ffilters.oxygen.erb#L46

And that line says the udp2log output is being piped to gadolinium. So, in conclusion, if Oxygen lost UDP packets, gadolinium would not see them, which would affect webstatscollector. One way to check this will be to compare pageviews from the sampled log and the output of webstatscollector, something I will do soon.

In general, other theories can be informed from the Server Admin Log: https://wikitech.wikimedia.org/wiki/Server_Admin_Log which for this time period says that a couple of varnish hosts were restarted and an apache was restarted. It's possible maybe that the udp packet loss script just got confused because of those restarts and saw sequence numbers it wasn't expecting so assumed loss. This can be tested by looking at the sequence numbers in the sampled logs.

TL;DR:

  • analytics1003 alarms were harmless.
  • oxygen alarms point to real packet-loss, which affected all udp2log multicast consumers during two ~4 hour periods.
  • The issue is expected to re-occur for the seldom traffic spikes we see.
  • The issue is /not/ expected to re-occur anytime soon for our usual day-to-day traffic.
  • Let's not invest in adjusting udp2log setup and instead move to kafka :-)

  • analytics1003 alarms:

I could not find a consumer of udp2log data on analytics1003 other
than sqstat. And also puppet says [1], that udp2log is only there for
sqstat.

Hence, assuming the alarms we saw around analytics1003 did not cause
issues in real data pipelines.

  • oxygen alarms:

While we saw the alarms on oxygen, they were really caused by a
bottle-neck on gadolinium (see below), which caused message loss for
all udp2log multicast consumers. So for example webstatscollector
(stats.grok.se, ...), mobile-sampled-100 stream, zero stream [2].

The losses happened during two periods.

  • 2014-07-08 19:00 -- 2014-07-08 22:00
    • eqiad: ~3% loss
    • esams: ~7% loss
    • ulsfo: ~10% loss
  • 2014-07-13 19:00 -- 2014-07-13 23:00
    • eqiad: ~12% loss
    • esams: ~25% loss
    • ulsfo: ~25% loss

Those two periods map to events of the Fifa World Cup 2014 [3].
Requested pages, and referers during that time also underpin that the
increase in traffic is indeed soccer related.


  • Bottle-neck on gadolinium

For gadolinium in and outbound traffic are typically close to each
other in volume (typically difference is less than 2MB/s). But around
both of the above periods,

  • in-bound and out-bound traffic together grew up to 70 MB/s. Then
  • in-bound traffic grew further (max ~95MB/s on 2014-07-13), while
  • out-bound traffic stayed close to 70 MB/s. After some time
  • in-bound traffic came down to 70 MB/s. And
  • in-bound and out-bound traffic together decreased again to their usual daily pattern.

So the bottle-neck looks a bit like either

  • a limit of out-bound network bandwith, or
  • lack of resources to produce enough out-bound packets.

Since the issue is some days back, it's hard to get more logs and rule
any of the two out.

However, gadolinium's network card should be able to bring more data
to the wire. Also SNMP for gadolinium does not show sending errors.

On the other hand, it seems the socat process that is feeding the
multicast (by far the biggest part outbound traffic on gadolinium) is
indeed a bottle-neck. This process is continuously using between
70%-95% of a core. This percentage changes over time and closely
follows the amount of inbound network traffic. Extrapolating from this
relation, we should expect issues somewhere around >65MB/s inbound
traffic.

This extrapolation matches the above periods, as during those periods,
inbound traffic jumped >70MB/s, while we're typically <60MB/s for
normal days.

If we do not take actions, I'd expect

  • any reasonable traffic spike to cause a similar udp2log outage.
  • normal traffic to not cause similar udp2log on a regular basis.

    We still have some tiny room for growth, and that room will keep us covered for usual day-to-day traffic for some time (at least a few months if there is no considerable change in the way our traffic changes).

Of the many paths forward, only two seem viable to me. We could

  • ignore the udp2log issues, as they only hit us for rare spikes, and put more effort on moving to kafka. Kafka infrastructure protects us against this kind of failure.
  • ask ops to split gadolinium's incoming udp2log traffic into two parts, and then keep one of the two parts on gadolinium, while feeding the other part to a separate socat process that produces to the same multicast address.

    This change would be transparent for multicast consumers, and distribute the socat load among two processes; hence removing the socat bottleneck.

    Once the bottleneck on gadolinium is removed, consumers of this data pipeline should be able to handle the spikes, as udp2log and it's grep-based filters, as well as udpfilter, and webstatscollector filter are all using <50% of a CPU. So the downstream consumers have enough resources to handle spikes.

Given how seldom we see spikes, I'd vote for focusing on kafka.

(I am leaving closing the bug to management, as they need to decide on
this)

[1] https://git.wikimedia.org/blob/operations%2Fpuppet.git/fce2b1c036d503723fbea865273f2d8a27004546/manifests%2Fsite.pp#L114

[2] Note that sampled-1000 stream is /not/ affected as that is

generated by a separate udp2log pipeline.

[3] On 2014-07-08, a semi final took place.

On 2014-07-13, the final took place.

Marking issue as resolved.
I have prioritized focusing on Kafka next so there is no more work needed on UDP2LOG.
https://bugzilla.wikimedia.org/show_bug.cgi?id=68139