Page MenuHomePhabricator

Add cache response type and response size as new dimensions to navtiming_responsestart_by_host_seconds prometheus metric
Open, HighPublic

Description

This will likely require merging the ServerTiming EventLogging schema into the https://meta.wikimedia.org/wiki/Schema:NavigationTiming one, so that the navtiming daemon can get this data in the same record it currently collects the responseStart metric from.

Event Timeline

Gilles renamed this task from Add cache response type as a new dimension to response type by host prometheus navtiming metric to Add cache response type as a new dimension to navtiming_responsestart_by_host_seconds prometheus metric.Oct 8 2020, 6:52 AM
Gilles triaged this task as High priority.
Gilles created this task.
Gilles updated the task description. (Show Details)
Gilles updated the task description. (Show Details)

Change 632879 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/extensions/NavigationTiming@master] Fold cache response type data into NavigationTiming

https://gerrit.wikimedia.org/r/632879

Change 632883 had a related patch set uploaded (by Gilles; owner: Gilles):
[performance/navtiming@master] Add cache response type as dimension to per-host metric

https://gerrit.wikimedia.org/r/632883

Gilles renamed this task from Add cache response type as a new dimension to navtiming_responsestart_by_host_seconds prometheus metric to Add cache response type and response size as new dimensions to navtiming_responsestart_by_host_seconds prometheus metric.Oct 12 2020, 8:58 AM

@ema since these new dimensions are labels, for transfersize we're going to need to come up with buckets ourselves. What buckets would you be interested in tracking?

Looking at October traffic, these are the percentiles I'm seeing for transferSize from RUM data (2673027 samples), in bytes:

p109478
p5019431
p7533920
p9061715
p9588658
ema added a comment.Oct 12 2020, 9:37 AM

@ema since these new dimensions are labels, for transfersize we're going to need to come up with buckets ourselves. What buckets would you be interested in tracking?

Based on the October percentiles you've mentioned it seems to me that it could be interesting to define the following buckets:

0 - 10k
10k - 20k
20k - 30k
30k - 60k
60k - inf

Gilles moved this task from Inbox to Doing on the Performance-Team board.Oct 13 2020, 6:50 PM

Change 632883 merged by jenkins-bot:
[performance/navtiming@master] Add cache response type as dimension to per-host metric

https://gerrit.wikimedia.org/r/632883

Change 634228 had a related patch set uploaded (by Gilles; owner: Gilles):
[performance/navtiming@master] Add transfer size buckets as new dimension by host

https://gerrit.wikimedia.org/r/634228

Change 634228 merged by jenkins-bot:
[performance/navtiming@master] Add transfer size buckets as new dimension by host

https://gerrit.wikimedia.org/r/634228

Change 632879 merged by jenkins-bot:
[mediawiki/extensions/NavigationTiming@master] Fold cache response type data into NavigationTiming

https://gerrit.wikimedia.org/r/632879

The new cacheReponseType field is being collected correctly in Hive:

SELECT COUNT(*), event.cacheResponseType FROM event.navigationtiming WHERE year = 2020 AND month = 11 AND day = 30 GROUP BY event.cacheResponseType;

_c0	cacheresponsetype
104711	NULL
158735	hit-front
3508	hit-local
78574	miss
20267	pass

NULL responses are from browsers that don't support Server Timing.

And as expected the ServerTiming schema no longer collects data:

SELECT COUNT(*) FROM event.servertiming WHERE year = 2020 AND month = 11 AND day = 30;

_c0
10

Those 10 hits are probably stragglers from people with old JS cached (eg. frozen browser tab reawakened).

Change 644201 had a related patch set uploaded (by Gilles; owner: Gilles):
[analytics/refinery@master] ServerTiming has been folded into NavigationTiming

https://gerrit.wikimedia.org/r/644201

I've added cache response type to the per host dashboard: https://grafana-rw.wikimedia.org/d/M7xQ_BeWk/response-time-by-host

I can't manage to add transfer size to that dashboard, for some reason the time-shifted graphs don't work with it. Maybe a Grafana bug?

Change 644201 merged by Mforns:
[analytics/refinery@master] ServerTiming has been folded into NavigationTiming

https://gerrit.wikimedia.org/r/644201