Page MenuHomePhabricator

Define key metrics (3-5 total) that we're tracking and go public with the metrics
Open, HighPublic

Description

Decide on a small set of key performance metrics (3-5 total) that are well-defined by us and that we will actively track and report about on a regular basis.

Ideas for what those metrics might be:

  • Frontend Save Timing (RUM)
  • Backend Save Timing (RUM)
  • SpeedIndex (Synthetic)
  • firstVisualChange (Synthetic)
  • lastVisualChange (Synthetic)
  • TTFB (RUM)
  • firstPaint (RUM)
  • domComplete (RUM)
  • loadEventEnd (RUM)
  • Total bytes transferred (Synthetic)

Questions about Save Timing:

  • Bots, humans, both?
    • <Peter> If it's easy I think we should focus on humans. – T187684#4037780

Questions for RUM metrics:

  • Aggregation:
    • Which aggregate do we use? (median, p75, p99, max)
      • <Peter> Not max right because it is polluted. Median and one high percentile. – T187684#4037780
    • What timespan do we aggregate on? (weekly, monthly, quarterly, ..)
    • How do we compute the aggregates?
      • <Krinkle> From graphite we can only use per-minute values from past 7 days in a meaningful way. Anything older over longer periods is statistically meaningless. We'll need to compute it ad-hoc from EventLogging raw data (MySQL/Hadoop). This is definitely do-able, and we've done this in the past with help of @Tbayer (Analyst for Reading).
  • Fragmentation:
    • Platform (mobile, desktop), both, or overall?
    • Country, pick a few or overall?
    • Device connection, E.g. 2g, 3g, 4g+desktop, or overall?
    • Browser, pick a few or overall?
    • Authenticated vs anonymous users.

Questions for synthetic metrics:

  • <Peter> Do we use data from WebPageTest or WebPageReplay+browsertime?
  • <Krinkle> How do we account for changes to the page itself?
    • Metrics that are affected by below-the-fold content ("Total bytes transferred", domComplete, loadEventEnd) may vary a lot due to something as simple as the addition or removal of a video, for example. We may want to consider creating a copy of our test subject pages somehow and test those instead.
    • <Peter> I'm not too keen on that idea, it's like synthetically testing a synthetic page :) I think if we test three pages, we talk about all of them, if we see a diff in all three, it is a valid diff. The same way as we do today in alerts.

Notes:

  • We should include RUM data from authenticated users so we don't forget to focus on them.

Event Timeline

We need to report them in different buckets, one for RUM and one for synthetic testing and also split them by desktop/mobile. The synthetic testing is two different ones too: WebPageTest and WebPageReplay. The first we measure what it looks like for the user in specific conditions, the latter we try measure regressions in our code.

The problems we have with synthetic testing is two: The numbers is very dependent on the page we test and then for the numbers we get from WebPageTest they have high variance, so what numbers should we report?

The problem we have with RUM is that we don't keep it in separate buckets so for example increased/decreases in our metrics could depend on user characteristics.

For RUM we been focused on First Paint and Save Timings and I think we should continue with that (even though First Paint is really Chrome focused).

Synthetic First Visual Change and Speed Index. I like first visual change because it identifies when something is happening and Speed Index since it calculates when the viewport is ready.

One metric that I like that we aren't using today is last visual change - first visual change = how long time does it take for the viewport to be ready, a kind quality metric but it's also really dependent on which page we test.

I can't find the definition of last visual change, does it include things below the fold?

Krinkle renamed this task from Define a specific set of metrics (3-5 total) that we're tracking and go public with the metrics to Define key metrics (3-5 total) that we're tracking and go public with the metrics.Feb 19 2018, 9:22 PM
Krinkle triaged this task as High priority.Feb 19 2018, 9:25 PM

I can't find the definition of last visual change, does it include things below the fold?

No only within the viewport.

We shouldn't be afraid that we change this later on (we can revisit it once a year or so).

I think one important aspect is that if we find a regression we want to be able to explain it.

I'll try to get the discussion going. IMHO:

For navtiming metrics, should we use RUM, Synthetic, or report on both?

Navtiming for RUM, Visual Metrics (first visual change, speed index) for synthetic.

Which aggregate to use (for RUM)? (median, p75, p99, max)

Not max right because it is polluted? Median and one high percentile?

Should we filter? (e.g. human edits only, for save timing)

If it's easy I think we should focus on humans.

Should we fragment? (e.g. separate the RUM metrics by device: mobile/desktop, or by device connection: 2g, 3g, 4g+desktop)

Since have different content, I think minimum fragmentation is mobile/desktop. The same for synthetic.

Should we specialise (e.g. have a synthetic-like metric from RUM limited to just a specific combination of country, connection, and wiki, or something).

I like that idea but how stable metrics do we have if we do that? Or should we increase the sample rate for that combination?

Authenticated vs anonymous users. We should include authenticated so we don't forget to focus on those users.

Let us test that the first time we get the metrics and look at how stable the authenticated are.

I think the assumption of picking a specific combination of factors for RUM would be that we oversample that particular set. We have to mindful of privacy implications on small wikis when we do that, though. I think doing this would also let us pick a more advanced RUM metric only available on some browsers. Which while wouldn't represent the experience on all browsers, could be slightly closer to what users experience.

RUMSpeedIndex is still stuck in code review, but I would be curious to see how it behaves when used this way (for example looking at only Chrome, on mobile 3g speeds in a particular country), when compared to more basic metrics like first contentful paint.

I think we can push RUMSpeedIndex , what I would also like if we could pick it up in WebPageTest. I think we can just add a global JS on the WebPageTest server so it will always test it? I can have a go later today.

Krinkle updated the task description. (Show Details)
Krinkle added a subscriber: Tbayer.
Krinkle added a subscriber: Krinkle.

Ideas for what those metrics might be:

  • Frontend Save Timing (RUM)
  • Backend Save Timing (RUM)
  • SpeedIndex (Synthetic)
  • firstVisualChange (Synthetic)
  • lastVisualChange (Synthetic)
  • TTFB (RUM)
  • firstPaint (RUM)
  • domComplete (RUM)
  • loadEventEnd (RUM)
  • Total bytes transferred (Synthetic)

Questions about Save Timing:

  • Bots, humans, both?
    • <Peter> If it's easy I think we should focus on humans. – T187684#4037780

Questions for RUM metrics:

  • Aggregation:
    • Which aggregate do we use? (median, p75, p99, max)
      • <Peter> Not max right because it is polluted. Median and one high percentile. – T187684#4037780
    • What timespan do we aggregate on? (weekly, monthly, quarterly, ..)
    • How do we compute the aggregates?
      • <Krinkle> From graphite we can only use per-minute values from past 7 days in a meaningful way. Anything older over longer periods is statistically meaningless. We'll need to compute it ad-hoc from EventLogging raw data (MySQL/Hadoop). This is definitely do-able, and we've done this in the past with help of @Tbayer (Analyst for Reading).
  • Fragmentation:
    • Platform (mobile, desktop), both, or overall?
    • Country, pick a few or overall?
    • Device connection, E.g. 2g, 3g, 4g+desktop, or overall?
    • Browser, pick a few or overall?
    • Authenticated vs anonymous users.

Questions for synthetic metrics:

  • <Peter> Do we use data from WebPageTest or WebPageReplay+browsertime?

*<Krinkle> How do we account for changes to the page itself?

  • Metrics that are affected by below-the-fold content ("Total bytes transferred", domComplete, loadEventEnd) may vary a lot due to something as simple as the addition or removal of a video, for example. We may want to consider creating a copy of our test subject pages somehow and test those instead.

Notes:

  • We should include RUM data from authenticated users so we don't forget to focus on them.

How do we compute the aggregates?

From Graphite we can only use per-minute values from past 7 days in a meaningful way. Anything older over longer periods is statistically meaningless. We'll need to compute it ad-hoc from EventLogging raw data (MySQL/Hadoop). This is definitely do-able, and we've done this in the past with help of @Tbayer.

I talked with @Imarlier yesterday and the way forward is to just decide the metrics to start with, and then we can continue to redefine them. I'll make a first proposal and you all @Krinkle @Gilles @aaron @Imarlier can edit/change make/suggestions:

The report

The first report we do we could actually report the exact metrics we have today and explain them. That could be a nice first post/report.

Synthetic testing

  • We test three pages on desktop and three on emulated mobile on one set of connectivity. On desktop we test on Chrome and Firefox. We only test for anonymous and first view as a first step.
  • We report first visual change and last visual change (and keep track of Speed Index but focus on the two first metrics because they are easier to explain/understand).
  • We focus on changes that can be trackable for the three URLs. We explain changes in the environment since last tests (new browser versions, bugs etc).

RUM

  • We report first paint for anonymous/logged in users for desktop and mobile. We report median and p?
  • Save timing (divided in front end vs backend time?)
  • We collect TTFB, domComplete and loadEventEnd, keep them in the report for historical reasons but do not focus at them for now.

Then we focus on how we want to split in the future?

I think that for Save timing frontend is enough to talk about. That's what users experience, backend timing is something that's useful to us to sanity check or track down the source of a problem, but not something meaningful to the user.

Past two weeks, I keep circling back to the same questions: What metrics should we report and compare? What metrics matter most due to direct
and simple correlation to user experience?

This is mostly in the context of measuring performance on each commit in Jenkins with Fresnel (T133646), but also during code review when trying to justify why a certain component should or shouldn't under go a certain change for performance reasons.

There are a lot of metrics available on the web platform today. Many of them are useful in some cases. But many of them can also be deceptive or counter-productive. I've tried to go over a bunch of metrics today in an attempt to narrow down a handful that I believe matter most for front-end development, or perhaps better categorised as Page views.

  1. Navigation Timing: loadEventEnd
    • What: The initial load of a web page is fully completed. Implemented by browsers as the time when the navigation resource (HTML response) and all directly chained sub-resources (css, images, initial JS requests) have been completely downloaded, parsed and executed.
    • User experience: This is primarily observed by users through native progress indicator. It also indirectly impacts user experience by being a good proxy for some of the sub resources that also directly impact visual completion metrics. The "load event" includes CSS and initial images, which are required for visual rendering to start.
    • When: Production: yes, from real-users and from synthetic tests. Continuous integration, yes.
  2. Paint Timing (RUM) / Visual metrics (Synthetic)
    • What: The visual appearance of the web page. In RUM metrics, this corresponds today to the "first paint", and the "first contentful paint" (paint with text or image from body content, e.g. not css layout or icons). For synthetic testing, this would correspond to "first visual change" and "last visual change".
    • User experience: When the page is visually complete, anything the user might be waiting for is ready. They might start interacting sooner, but it's unlikely that there would be a reason within the realm we control for them to wait after the visual rendering is complete. The "last visual change" is important to separate from "load event" because asynchronous javascript can (although, should not) cause the page appearance to become different. For example, the banner insertions happen late, which may be disruptive. Or JavaScript may modify parts of the page in ways that was meant to be identical to the server/CSS rendering, but became different.
    • When: Production, yes, from real-users and from synthetic tests. Continuous integration, yes.
  3. Resource Timing
    • What: The network transfer size, as captured in entry.transferSize (spec, mdn; not encodedBodySize), broken down by type (navigation: HTML, link: stylesheets, scripts, images, etc.)
    • User experience: Consuming user's bandwidth has a direct cost that they may experience (financial cost, and battery use). It is also an indirect indicator of page load time. For example, when requests can be responded to by use of a local cache, many benefits are had:
      • Latency: the response is immediate without network,
      • Avoid parse and compilation time: browsers can re-use the binary result of compiling JavaScript code, Chrome/V8 do this based on URL and source checksum. This means that just like how executing the same function gets faster over time (due to compiler optimisations), this is true across page views as well.
      • Reduce execution time: Shorter scripts and stylesheets, tend to take less time to parse/compile/execute than longer ones.
    • When: Production, yes, from real-users and from synthetic tests. Continuous integration, yes.
  4. Long Tasks (RUM)
    • What: Detect when JavaScript code executes for a long time without interruption.
    • User experience: User input (scroll, select, type, click, hover, etc.) and animation frames may be need response from JavaScript callbacks to respond to users. This is delayed (causing "jank") if unrelated code is executing. Registration of modules should be cheap, with actual work delayed to when it is needed, and performed in chunks that can be interrupted if needed (for input handling), and can happily execute back-to-when when not needed. See also https://w3c.github.io/longtasks/#intro.
    • When:
      • Production: Yes, from real-users. Maybe from synthetic as well, although it has similar metrics such as "Time to interactive" (which measures the first sequence of 5s without long task warnings).
      • CI: Yes.

What do you think? Do these four metrics capture everything? What other metrics directly affect user experience? Could we one of the above be replaced with something better?

There's a few other metrics I considered, but thought weren't as good:

  • Navigation Timing: domInteractive. – This is when the browser has finished parsing the HTML. I don't think there is any direct correlation here to user experience. Visual completion could and does sometimes complete before this. Browsers perform "stream parsing" on HTML and don't depend on below-the-fold HTML to render. Some JavaScript code does relates to the dom-ready event (which fires around the same time as domInteractive), but as long as their callbacks don't become "long tasks" and don't make visual changes (which above metrics capture), then it doesn't matter if they run early or late.
  • Navigation Timing: domComplete. – Nearly identical to loadEventEnd, except slightly before the indicator is complete as it doesn't cover the window.onload callback itself. Useful to know when the sub-resources complete, but not as key metric I think.
  • Navigation Timing: responseStart (TTFB). – Incredibly useful, and should be a key metric I think. But, I think we should distinguish between "backend" metrics and "frontend" metrics. This would be a backend metric (together e.g. with save timing). I'm focussing on frontend at the moment.
  • Resource Timing: Number of requests. – Ambiguous. Sometimes fewer requests performs better, sometimes more performs better.
    • Bandwidth efficiency is captured by transfer size already, which would improve if fewer requests is more efficient.
    • Processing efficiency is captured by loadEventEnd already. E.g. a larger request may takes more time to parse before execution can begin. Slitting a request might allow server and browser to process two requests in parallel.
    • Execution cost is captured by Long Tasks. If a combined response contains (unrelated) code that all performs some work during initialisation, it might end up together causing jank.
kchapman moved this task from Next: Goal-oriented to Inbox on the Performance-Team board.

Lets call them Wikipedia Vitals :)