Page MenuHomePhabricator

Navigation Timing is broken
Closed, ResolvedPublic

Description

We got an alert for the overall increase in diff for firstPaint:

Screen Shot 2017-09-18 at 7.36.03 AM.png (1×2 px, 249 KB)

https://grafana.wikimedia.org/dashboard/db/navigation-timing-alerts?refresh=5m&panelId=12&fullscreen&orgId=1

It seems like it started 14/9 16.xx or 15/9 06.xx (larger increase).

Checking the server log, my change for navigation timing went live 14/9

  • 16:59 krinkle@tin: Synchronized php-1.30.0-wmf.17/extensions/NavigationTiming: T104902 (duration: 01m 00s)

It seems like median/p75/p95 are not changed on the fly,
see that they keep the same value for hours (https://grafana.wikimedia.org/dashboard/db/navigation-timing-alerts?refresh=5m&orgId=1&panelId=23&fullscreen&from=1505436334194&to=1505598247554):

Screen Shot 2017-09-18 at 7.41.57 AM.png (1×2 px, 209 KB)

Event Timeline

It seems like median/p75/p95 are not changed on the fly,
see that they keep the same value for hours (https://grafana.wikimedia.org/dashboard/db/navigation-timing-alerts?refresh=5m&orgId=1&panelId=23&fullscreen&from=1505436334194&to=1505598247554):

The metrics on that graph are 24h moving averages, it would makes sense that they don't move instantly.

However, something is very wrong with that graph (at least over that period), as median p75 and p95 are identical, with or without the moving average. The explanation is seen when zooming out, it seems like the data mostly stopped and the brokenness is just some artifact:

Capture d'écran 2017-09-18 09.18.21.png (464×1 px, 131 KB)

And the report rate confirms it, it tanked:

Capture d'écran 2017-09-18 09.19.53.png (546×1 px, 82 KB)

The timing corresponds with @Krinkle's deployment.

Gilles renamed this task from Investigate overall increase in diff for firstPaint 2017-09-15 19:00:02 to Navigation Timing is broken.Sep 18 2017, 7:22 AM
Gilles triaged this task as Unbreak Now! priority.

Maybe the ingestion point looks at the old EL schema? I'm going to check EL SQL data to see if it's working there or not.

SELECT COUNT(*) FROM `NavigationTiming_16305090` WHERE timestamp >= 20170912000000 AND timestamp < 20170913000000

361199

SELECT COUNT(*) FROM `NavigationTiming_17216284` WHERE timestamp >= 20170915000000 AND timestamp < 20170916000000

288187

Looks like the same order of magnitude, difference is probably due to weekday/weekend. This would suggest that Grafana/graphite is just looking at the old schema, which is still getting a tiny trickle of clients with outdated cached client side code referencing the old schema version.

The ideas was that we didn't need to change coal at at the same moment (pickup that new metrics) but we missed something, I'm not familiar how it exactly works.

Coal is fine, it's the EL/kafka intake that seems to be stuck on the old schema.

Coal is fine, it's the EL/kafka intake that seems to be stuck on the old schema.

This is odd. We very intentionally only filter by schema name in our pipeline, not by revision, so that we always consider older versions without needing to fragment the handling at each layer. In addition, we aim that each version is backwards and forwards compatible (we only add optional fields and never tighten restrictions on existing fields).

Reporting of navtiming.fetchStart in Graphite went from 70/min to ~ 1/min:

Screen Shot 2017-09-18 at 16.45.18.png (934×2 px, 220 KB)

https://grafana.wikimedia.org/dashboard/db/navigation-timing?refresh=5m&panelId=12&fullscreen&orgId=1&var-metric=fetchStart

Reporting of navtiming.mediaWikiLoadComplete went from 200/min to 30/min:
https://grafana.wikimedia.org/dashboard/db/navigation-timing?refresh=5m&panelId=12&fullscreen&orgId=1&var-metric=mediaWikiLoadComplete

Statistics from Kafka production directly, also impacted, albeit not nearly as much. Went from around 300/min to 250/min.

Screen Shot 2017-09-18 at 16.46.25.png (818×2 px, 423 KB)

https://grafana.wikimedia.org/dashboard/db/eventlogging-schema?orgId=1&var-schema=NavigationTiming&from=now-7d&to=now

https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&from=now-7d&to=now&var-server=hafnium&var-network=eth0

Screen Shot 2017-09-18 at 16.47.09.png (565×2 px, 177 KB)

Change 378737 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/extensions/NavigationTiming@master] Unbreak navTiming reporting

https://gerrit.wikimedia.org/r/378737

Change 378740 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/extensions/NavigationTiming@wmf/1.30.0-wmf.18] Unbreak navTiming reporting

https://gerrit.wikimedia.org/r/378740

Change 378737 merged by jenkins-bot:
[mediawiki/extensions/NavigationTiming@master] Unbreak navTiming reporting

https://gerrit.wikimedia.org/r/378737

Change 378740 merged by jenkins-bot:
[mediawiki/extensions/NavigationTiming@wmf/1.30.0-wmf.18] Unbreak navTiming reporting

https://gerrit.wikimedia.org/r/378740

Mentioned in SAL (#wikimedia-operations) [2017-09-18T21:18:47Z] <krinkle@tin> Synchronized php-1.30.0-wmf.18/extensions/NavigationTiming/modules/ext.navigationTiming.js: Unbreak - T176105 (duration: 00m 48s)

Krinkle removed a project: Patch-For-Review.
Krinkle added a project: NavigationTiming.