Page MenuHomePhabricator

Navigation Timing is broken
Closed, ResolvedPublic

Description

We got an alert for the overall increase in diff for firstPaint:


https://grafana.wikimedia.org/dashboard/db/navigation-timing-alerts?refresh=5m&panelId=12&fullscreen&orgId=1

It seems like it started 14/9 16.xx or 15/9 06.xx (larger increase).

Checking the server log, my change for navigation timing went live 14/9

  • 16:59 krinkle@tin: Synchronized php-1.30.0-wmf.17/extensions/NavigationTiming: T104902 (duration: 01m 00s)

It seems like median/p75/p95 are not changed on the fly,
see that they keep the same value for hours (https://grafana.wikimedia.org/dashboard/db/navigation-timing-alerts?refresh=5m&orgId=1&panelId=23&fullscreen&from=1505436334194&to=1505598247554):

Event Timeline

Peter created this task.Sep 18 2017, 5:44 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 18 2017, 5:44 AM
Peter updated the task description. (Show Details)Sep 18 2017, 5:46 AM
Gilles added a subscriber: Gilles.Sep 18 2017, 7:19 AM

It seems like median/p75/p95 are not changed on the fly,
see that they keep the same value for hours (https://grafana.wikimedia.org/dashboard/db/navigation-timing-alerts?refresh=5m&orgId=1&panelId=23&fullscreen&from=1505436334194&to=1505598247554):

The metrics on that graph are 24h moving averages, it would makes sense that they don't move instantly.

However, something is very wrong with that graph (at least over that period), as median p75 and p95 are identical, with or without the moving average. The explanation is seen when zooming out, it seems like the data mostly stopped and the brokenness is just some artifact:

Gilles added a subscriber: Krinkle.Sep 18 2017, 7:21 AM

And the report rate confirms it, it tanked:

The timing corresponds with @Krinkle's deployment.

Gilles renamed this task from Investigate overall increase in diff for firstPaint 2017-09-15 19:00:02 to Navigation Timing is broken.Sep 18 2017, 7:22 AM
Gilles triaged this task as Unbreak Now! priority.
Restricted Application added subscribers: Liuxinyu970226, Jay8g, TerraCodes. · View Herald TranscriptSep 18 2017, 7:22 AM

Maybe the ingestion point looks at the old EL schema? I'm going to check EL SQL data to see if it's working there or not.

SELECT COUNT(*) FROM `NavigationTiming_16305090` WHERE timestamp >= 20170912000000 AND timestamp < 20170913000000

361199

SELECT COUNT(*) FROM `NavigationTiming_17216284` WHERE timestamp >= 20170915000000 AND timestamp < 20170916000000

288187

Looks like the same order of magnitude, difference is probably due to weekday/weekend. This would suggest that Grafana/graphite is just looking at the old schema, which is still getting a tiny trickle of clients with outdated cached client side code referencing the old schema version.

Gilles assigned this task to Krinkle.Sep 18 2017, 8:00 AM
Peter added a comment.Sep 18 2017, 8:48 AM

The ideas was that we didn't need to change coal at at the same moment (pickup that new metrics) but we missed something, I'm not familiar how it exactly works.

Coal is fine, it's the EL/kafka intake that seems to be stuck on the old schema.

Krinkle added a comment.EditedSep 18 2017, 3:41 PM

Coal is fine, it's the EL/kafka intake that seems to be stuck on the old schema.

This is odd. We very intentionally only filter by schema name in our pipeline, not by revision, so that we always consider older versions without needing to fragment the handling at each layer. In addition, we aim that each version is backwards and forwards compatible (we only add optional fields and never tighten restrictions on existing fields).

Reporting of navtiming.fetchStart in Graphite went from 70/min to ~ 1/min:


https://grafana.wikimedia.org/dashboard/db/navigation-timing?refresh=5m&panelId=12&fullscreen&orgId=1&var-metric=fetchStart

Reporting of navtiming.mediaWikiLoadComplete went from 200/min to 30/min:
https://grafana.wikimedia.org/dashboard/db/navigation-timing?refresh=5m&panelId=12&fullscreen&orgId=1&var-metric=mediaWikiLoadComplete

Statistics from Kafka production directly, also impacted, albeit not nearly as much. Went from around 300/min to 250/min.


https://grafana.wikimedia.org/dashboard/db/eventlogging-schema?orgId=1&var-schema=NavigationTiming&from=now-7d&to=now

https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&from=now-7d&to=now&var-server=hafnium&var-network=eth0

Krinkle moved this task from Inbox to Doing on the Performance-Team board.Sep 18 2017, 4:11 PM

Change 378737 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/extensions/NavigationTiming@master] Unbreak navTiming reporting

https://gerrit.wikimedia.org/r/378737

Change 378740 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/extensions/NavigationTiming@wmf/1.30.0-wmf.18] Unbreak navTiming reporting

https://gerrit.wikimedia.org/r/378740

Change 378737 merged by jenkins-bot:
[mediawiki/extensions/NavigationTiming@master] Unbreak navTiming reporting

https://gerrit.wikimedia.org/r/378737

Change 378740 merged by jenkins-bot:
[mediawiki/extensions/NavigationTiming@wmf/1.30.0-wmf.18] Unbreak navTiming reporting

https://gerrit.wikimedia.org/r/378740

Mentioned in SAL (#wikimedia-operations) [2017-09-18T21:18:47Z] <krinkle@tin> Synchronized php-1.30.0-wmf.18/extensions/NavigationTiming/modules/ext.navigationTiming.js: Unbreak - T176105 (duration: 00m 48s)

Krinkle closed this task as Resolved.Sep 18 2017, 10:03 PM
Krinkle removed a project: Patch-For-Review.