Page MenuHomePhabricator

VE edit data stopped due to statsv falling over (?) on webperf1001
Open, Needs TriagePublic

Description

https://grafana.wikimedia.org/d/000000094/visualeditor-load-save?orgId=1&from=1574537980392&to=1574572593910

No deploys at that time (01:25 UTC on a Sunday). No SAL entries. Did statsd or whatever fall over or something?

Occurences

  • 2019-11-24Z00:57
  • 2019-12-01Z22:45
  • 2019-12-04Z16:51

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 25 2019, 4:18 PM
mforns added subscribers: marcella, mforns.EditedNov 25 2019, 5:02 PM

@Jdforrester-WMF Does this go straight to statsd/grafana?
We believe it's not an EventLogging pipeline.
Do you know where it comes from?
Maybe @marcella?

mforns moved this task from Incoming to Radar on the Analytics board.Nov 25 2019, 5:04 PM
Nuria added a subscriber: Nuria.Nov 25 2019, 5:05 PM

Seems that this might be using m.track() from the client which sends data straight to statsd?

Yes, this isn't an EventLogging stats system, it's straight into statsd (or statsv? I forget which is which).

Yeah, there's always some serious noise in that data; maybe the triggering of the events stopped, and so the last, slow events came in "late"?

Krinkle added a subscriber: Krinkle.EditedNov 25 2019, 9:04 PM

This graph is based on a per-minute median, as well as with a 30-minute moving average applied.

It is generally expected for real-user metrics that when there is data loss or draining of traffic when a feature is disabled or broken, that the last few minutes skew heavily towards outliers. We've seen this consistently over the past five years whenever there is a service outage or instrumentation/processing failures. This is because in Graphite/Statsd we aggregate each minute separately, which over-represents outliers when there are only a handful of data points in that given minute, where we then treat each minute equally regardless of its population size (this is, in a nut shell, why we are moving away from this way of using Statsd, see blog post).

The same graph without moving median:

There is only one data point that is "off" (the last minute), and it only affected the p95, not even the median. This strongly suggests the last minute simply saw less extreme values between the median and the p95, thus making the p95 over the subset higher.

sample rate

The metric stopped at 00:57 exactly. With values upto that point all working fine, and after that point completely missing. This suggests it is not related to client-side instrumentation in VisualEditor's own code given it is not possible to deploy that instantaneously worldwide + everyone stopping their existing editing sessions. As such, it has to be something server-side in our infra that caused this.

TL;DR:

  • This cannot be caused by a bug in the VisualEditor instrumentation, nor in MediaWiki core, mw.track(), WikimediaEvents/statsd.js etc.
  • This has to be caused by something server-side in Varnish, Kafka, Statsv.py, Statsd, or Graphite.
  • Yet, it appears these are at least overall working okay in so far that various other Statsv metrics are still making their way through the pipeline unaffected.

It is possible that whatever is causing this is also affecting T234138.

Nuria added a subscriber: Ottomata.Nov 26 2019, 1:34 AM

Super nice explanation @Krinkle, shinny start of the week, really. And note to self I need to look at how these things are defined in graphite ("movingAverage(ve.$Target.performance.system.activation.{median,p95}, 30)") , and ya, averages of percentiles do not seem a great idea. A time series of percentiles seems like it would suffice in this case.

@Ottomata who in the larger SRE team is responsible for graphite/statsd?

Gotta say that series looks odd from 10/31 (and that might be completely unrelated to the problem at hand) ,

Of interest on 10/31 deploy:
https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/VisualEditor/+/547667/
https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/546724/2/wmf-config/InitialiseSettings.php

The data started back again at 2019-11-26Z14:43…

CDanis added a subscriber: fgiunchedi.
Krinkle added a comment.EditedNov 26 2019, 7:55 PM

Well, it certainly doesn't make sense that PHP memory_limit on wpt* servers could influence whether Varnish/Kafka/Statsv process web requests for /beacon, but at the same time, nothing else happened around that time. Even if this were somehow indirectly related one would expect it to trickle in over the course of 3-5 minutes, and not instantly switch back on globally at full rate from browsers.

Either way, we should probably start looking at the stack from end to end to see where data was lost. E.g. starting from the varnishkafka/webrequests_raw side to see whether those /beacon/statsv requests did keep coming in over the past few days at their normal rate or not. And if not, where they got lost (in varnishkafka/statsv filter, or in statsv.py etc.).

Yeah, I was thinking about maybe a key server getting stuck somehow, and getting unlocked by the next deploy, but it seems very unlikely.

Yeah, bandwidth congestion of some kind somehow selectively causing these requests to get lost on the edge, or maybe later on in the webperf1001/ganeti<->statsd-lb networking somehow saturated.

Nuria added a comment.Nov 26 2019, 9:15 PM

@Jdforrester-WMF Is this data published from parsoid servers or browser clients?

@Jdforrester-WMF Is this data published from parsoid servers or browser clients?

Client-side.

Moving to our Radar as the MW parts that we're familiar with seem to be working fine. Next suspect seems to the the Varnish/Kafka/statsv.py layer which Analytics are more familiar with. Note that this might be lower priority now as it seems to have recovered, but it is not the first time this has happened so getting to the bottom of it or at least knowing more about it would be useful, especially given that the data won't be available after 90 days, so now is a good time to capture and analyse what we have to maybe narrow down the root cause.

Jdforrester-WMF renamed this task from VE edit data stopped at ~2019-11-24Z01:25 to VE edit data stopped at 2019-11-24Z00:57 and again at 2019-12-01Z22:45.Dec 2 2019, 7:35 PM

It's re-broken itself.

Nuria added a comment.Dec 2 2019, 7:50 PM

Pinging @Ottomata in case he has any ideas

In statsv service logs on webperf1001:

Dec 01 22:45:46 webperf1001 python[13187]: Process Process-2:
Dec 01 22:45:46 webperf1001 python[13187]: Traceback (most recent call last):
Dec 01 22:45:46 webperf1001 python[13187]:   File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
Dec 01 22:45:46 webperf1001 python[13187]:     self.run()
Dec 01 22:45:46 webperf1001 python[13187]:   File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
Dec 01 22:45:46 webperf1001 python[13187]:     self._target(*self._args, **self._kwargs)
Dec 01 22:45:46 webperf1001 python[13187]:   File "/srv/deployment/statsv/statsv/statsv.py", line 191, in process_queue
Dec 01 22:45:46 webperf1001 python[13187]:     sock.sendto(statsd_message.encode('utf-8'), statsd_addr)
Dec 01 22:45:46 webperf1001 python[13187]: gaierror: [Errno -3] Temporary failure in name resolution

I don't know why it just stops working then, but it looks like statsv fails sending stats to statsd. I just restarted statsv there, and I think it is working ok now.

Also possibly relevant:

Dec 01 12:21:31 webperf1001 python[13187]: [357B blob data]
Dec 01 12:21:31 webperf1001 python[13187]: Traceback (most recent call last):
Dec 01 12:21:31 webperf1001 python[13187]:   File "/srv/deployment/statsv/statsv/statsv.py", line 175, in process_queue
Dec 01 12:21:31 webperf1001 python[13187]:     data = json.loads(raw_data)
Dec 01 12:21:31 webperf1001 python[13187]:   File "/usr/lib/python2.7/json/__init__.py", line 339, in loads
Dec 01 12:21:31 webperf1001 python[13187]:     return _default_decoder.decode(s)
Dec 01 12:21:31 webperf1001 python[13187]:   File "/usr/lib/python2.7/json/decoder.py", line 364, in decode
Dec 01 12:21:31 webperf1001 python[13187]:     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
Dec 01 12:21:31 webperf1001 python[13187]:   File "/usr/lib/python2.7/json/decoder.py", line 380, in raw_decode
Dec 01 12:21:31 webperf1001 python[13187]:     obj, end = self.scan_once(s, idx)
Dec 01 12:21:31 webperf1001 python[13187]: UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 39: invalid start byte
Nuria added a comment.Dec 2 2019, 8:07 PM

An error like that one would affect all metrics sent to statsv from the frontend seems like.

Nuria added a comment.Dec 2 2019, 8:12 PM

But other graphs seem unaffected (just looked at navigation timing ones). It is sure not caused by new chnages as stasv.py hasn't seen any for a while: https://github.com/wikimedia/analytics-statsv/commits/master/statsv.py

And again. @Ottomata, can you kick webperf1001 again?

Jdforrester-WMF renamed this task from VE edit data stopped at 2019-11-24Z00:57 and again at 2019-12-01Z22:45 to VE edit data stopped due to statsv falling over (?) on webperf1001.Dec 5 2019, 5:58 PM
Jdforrester-WMF updated the task description. (Show Details)
Restricted Application added a subscriber: Gilles. · View Herald TranscriptDec 5 2019, 5:58 PM

Done. BTW same errors were present in recent logs.

fgiunchedi moved this task from Inbox to Radar on the observability board.Dec 9 2019, 11:08 AM