Page MenuHomePhabricator

Investigate 60% drop in Navigation Timing report rate
Closed, ResolvedPublic

Description

It seems about 24 hours ago, with the deployment of 1.31.0-wmf.5 to (most) wikis, the report rate of Navigation Timing dropped very sharply by about 60% or more.

The group2 (all Wikipedias) deployment was originally scheduled for Thursday but delayed to Monday (T174361).

Overall
Grafana link
By browser
Grafana link

By platform

MobileDesktop
Grafana linkGrafana link

Strangely enough, despite it affecting all browsers, user types, platforms, and countries equally, the raw EventLogging report rate has not significantly decreased:

Event Timeline

Krinkle created this task.Nov 2 2017, 2:18 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 2 2017, 2:18 AM
Krinkle added a subscriber: Peter.EditedNov 2 2017, 7:45 AM

Given the Kafka and EventLogging streams haven't changed significantly in volume, this means the volume produced from clients hasn't changed.

Server Admin Log

This mostly rules out any infrastructure problem with ResourceLoader, EventLogging, Varnish, or Kafka.

Facts

  • The clients can't be broken as they still report stuff in the same volume as before.
  • The drop in metrics in Grafana doesn't seem to have impacted the medians in any significant way. And the drop in rate pretty much affects all browsers, users, platforms and countries equally.
  • The navtiming daemon hasn't been changed. And even if it would have changed, it has unit tests that would prevent something like events being ignored.

I've gone over the navtiming daemon and also did a manual opening of the kafka stream for the eventlogging_NavigationTiming topic to see what the data looks like. It looks fine as far as I can tell, and not very different from the example fixture we have in our unit tests.

sslNegotiation

After hooking up a hacked version of navtiming.py that subscribes to the steam and logs the would-be statsd writes to the terminal output, it seems to be fine. However, after ruling out everything else, I took another look at what it was producing and noticed a disproportionate number of sslNegotiation being sent to statsd.

Indeed, looking at sslNegotiation metric on the Navigation Timing dashboards (instead of mediaWikiLoadComplete or loadEventEnd) around the same time where most navtiming metrics got dropped, sslNegotiation actually grew!

Overall: sslNegotiationOverall: mediaWikiLoadComplete
Grafana linkGrafana link
+65/min -70/min

Now it all makes sense.

Two months ago, when working on the new navtiming2 metrics (T104902) @Peter and I noticed how the old metrics have a bug where any packet with secureConnectionStart has only that metric reported to Graphite and the rest of the packet dropped. This is due to some old code from 2015 (f2562a5a9e2) from a time where HTTPS was opt-in and we were interested in measuring its impact (if any) on performance.

After HTTPS became the default, we forgot to remove this. It must've significantly affected how much data we were receiving at the time, but due to the fact that SSL sessions are shared between page views, and that the client-side NavigationTiming code is excluding zeroes, this meant that any repeat view, essentially got interpreted the same way our pre-2015 non-HTTPS traffic would have reported an event. As such, we were still getting plenty of data and we didn't notice anything.

While this bug is fixed in the new navtiming2 code, that namespace is still being developed. Our main dashboards use the navtiming namespace in Graphite, and as part of fixing an unrelated bug in navtiming2 (T178479), we made a change in the client to no longer exclude zeros from the client-side (c40a22675d). This obviously meant that we were now always reporting secureConnectStart, and as such making the old code even more broken because where previously it was only showing metrics for repeat views (where secureConnectStart was 0 and thus excluded), it is now always reported if the property exists in the performance.timing object.

Unequal browsers: NavTiming properties

So then how come we are still getting metrics? How come it appears to be affecting browsers equally?

Well, looking at the browser breakdown shows that it actually isn't affecting browsers equally for true Navigation Timing metrics. The mediaWikiLoadComplete metric seems to be affected equally in all browsers (about 50% drop from all major browsers). But looking at any other metric (firstPaint, domInteractive, loadEventEnd, etc.) shows the data has completely disappeared except for MSIE, Edge and Firefox:

By browser: firstPaintdomInterativeloadEventEnd
By browser: loadEventEnd
MSIE/Edge/FirefoxOthers

So it's clear that the reporting of MSIE/Edge/Firefox has not changed. Looking at the raw Kafka/EventLogging stream, I find that indeed packets from these browsers indeed rarely include secureConnectStart (only a few from Firefox, and only Firefox 56).

Turns out, secureConnectStart was optional in the initial version of the W3C Navigation Timing spec and became mandatory in Navigation Timing 2 - which MSIE/Edge haven't implemented yet. And Firefox 56 only just shipped last month adding support for it. As such, most MSIE/Edge/Firefox traffic is simply never providing this attribute (zero or not).

Firefox 56 adds secureConnectionStart support

Okay. That explains why the NavTiming properties dropped the way they did.

Equal browsers: MW properties

How come mediaWikiLoadComplete is still seemingly being reported by all major browsers, including the ones that (as far as we know) support secureConnectStart and we send a zero from those clients where possible?

Well, it turns out there's quite a number of browsers that are Safari/WebKit/Chrome look-a-likes falling into the catch-all "Mobile Safari" or "Chrome" category. There are also a number of clients with outdated versions of these browsers that don't support the Navigation Timing API, or that produce non-compliant data that results in NavTiming props being ignored by the client but MW props still reported. This is not new behaviour, it just means that the remaining traffic is that from browsers where NavTiming isn't supported or compliant.

Change 387997 had a related patch set uploaded (by Phedenskog; owner: Phedenskog):
[operations/puppet@production] Removed old secureConnectionStart check that throw away metrics

https://gerrit.wikimedia.org/r/387997

Peter awarded a token.Nov 2 2017, 1:03 PM
Peter added a comment.Nov 2 2017, 1:09 PM

Wow @Krinkle! I'm super impressed with how you investigated this issue and how you so thoroughly described it. I think we can take your description and make it into a blog post on our blog.

Change 387997 merged by Giuseppe Lavagetto:
[operations/puppet@production] navtiming: Removed old secureConnectionStart check that throw away metrics

https://gerrit.wikimedia.org/r/387997

Wow @Krinkle! I'm super impressed with how you investigated this issue and how you so thoroughly described it. I think we can take your description and make it into a blog post on our blog.

Thanks. I'm working on T153048 now, maybe I'll do this one next :)

Krinkle claimed this task.Nov 2 2017, 10:06 PM
Krinkle triaged this task as High priority.
Krinkle moved this task from Inbox to Doing on the Performance-Team board.Nov 6 2017, 8:39 PM
Krinkle renamed this task from Investigate 60% loss of Navigation Timing reporting to Investigate 60% drop in Navigation Timing report rate.Nov 6 2017, 8:57 PM

Next: Improve our monitoring/alerts so that we get notified if the rate drops this far again in the future.

I've added the alerts for report rates to https://grafana.wikimedia.org/dashboard/db/navigation-timing-alerts

Alerts will fire based on avg() rate of last hour, evaluated every 30min.


Krinkle closed this task as Resolved.Jan 9 2018, 7:01 PM