Page MenuHomePhabricator

Refactor NavigationTiming extension so that it can be used to oversample based on criteria
Closed, ResolvedPublic

Description

The NavigationTiming extension currently has hardcoded logic to oversample FirstPaint data from Asia - this is being used to collect extra data in preparation for Singapore going live, so that we can more clearly see the effect of this change. Collected data is submitted to a statsd metric name.

It would be useful to be able to oversample NavTiming data for additional criteria as well, and to be able to configure this without having to write new code each time. A specific example where this would have been useful is the release of FF57, as we would be able to collect a fairly large data set on actual end user performance relatively quickly, without polluting our general sample.

Things that are needed in ext.NavigationTiming.js:

  1. Enhance the Event object so that it includes a boolean value indicating whether it is an oversample
  2. Add a generic method that checks whether oversampling is enabled at all
  3. Add a method that checks whether Geo oversampling (country or region) is enabled
  4. Add a method that checks whether User-Agent oversampling is enabled
  5. Emit non-oversampled event if needed
  6. Emit oversampled event if needed. This does mean that certain events will be collected twice. They will end up in different locations, so this is okay.

Things that are needed in the NavigationTiming schema:

  1. Add the oversample boolean (default: false)

Things that are needed in webperf.py:

  1. Check the oversample boolean, if it exists, and direct the collected event appropriately.

Potential ramifications/drawbacks:

Depending on oversample rate, this could increase the number of items on the NavigationTiming queue in Kafka by quite a bit. Need to check with the Analytics team to ensure that this won't be an issue.

Checklist:

Event Timeline

Depending on oversample rate, this could increase the number of items on the NavigationTiming queue in Kafka by quite a bit. Need to check with the Analytics team to ensure that this won't be an issue.

Unless we are talking about 100s of events per sec this should be fine.

Maybe Wikimediaevents is of use to manage smaple ratios? https://github.com/wikimedia/mediawiki-extensions-WikimediaEvents

Change 394298 had a related patch set uploaded (by Imarlier; owner: Imarlier):
[mediawiki/extensions/NavigationTiming@master] ext.NavigationTiming: Allow oversampling based on geography or user agent

https://gerrit.wikimedia.org/r/394298

Change 394375 had a related patch set uploaded (by Imarlier; owner: Imarlier):
[operations/puppet@production] webperf.py: Handle oversamples differently than regular samples

https://gerrit.wikimedia.org/r/394375

Krinkle triaged this task as Medium priority.Dec 15 2017, 1:54 AM
Krinkle updated the task description. (Show Details)

Change 402867 had a related patch set uploaded (by Imarlier; owner: Imarlier):
[operations/puppet@production] modules/webperf: handle oversamples differently than regular samples

https://gerrit.wikimedia.org/r/402867

@Krinkle @Gilles Opened a new patchset with my changes to webperf.py, and to the test fixtures. Not sure what I did to my original branch, but when I tried to rebase everything went pear-shaped. Easier to just cherry-pick over and resubmit.

Timo, your comments from the original review have been addressed in this patchset.

Change 394375 abandoned by Imarlier:
webperf.py: Handle oversamples differently than regular samples

Reason:
I dunno what I did to my branch, but whatever it was broke everything. Opened a new patchset for these changes.

https://gerrit.wikimedia.org/r/394375

This will be awesome! I miss docs describing limits and how we do it (but we can add that when it's merged). Like how should we keep track of different oversamplings running, can you run multiple at a time, what should you think about etc.

This will be awesome! I miss docs describing limits and how we do it (but we can add that when it's merged). Like how should we keep track of different oversamplings running, can you run multiple at a time, what should you think about etc.

I was planning to add docs as soon as the change is merged.

There actually is an issue with having multiple oversample groups running at one time. Let's say that we were oversampling a country 'XX' at 1-out-of-100 requests, and oversampling a user agent 'Firefox 100' at 1-out-of-10. If more than 1% of the users in countries 'XX' were using the new browser, then our sample set from country 'XX' would include more users than it should (and those users are going to be more similar than they otherwise would be). That's worth thinking about, but for the moment I think I'm happy to punt on it.

(The right approach is probably to do a separate check for whether to oversample based on geo and on UA, and to indicate in the event why the oversample occurred - we can tag/aggregate based on that after receiving)

Change 402867 merged by Dzahn:
[operations/puppet@production] webperf: Handle oversamples differently than regular samples

https://gerrit.wikimedia.org/r/402867

Thanks, @Dzahn - verified in prod, all is well. Much appreciated!

Change 394298 merged by jenkins-bot:
[mediawiki/extensions/NavigationTiming@master] ext.NavigationTiming: Allow oversampling based on geography or user agent

https://gerrit.wikimedia.org/r/394298

Change 416627 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/extensions/NavigationTiming@master] [WIP] Add integration test for oversampling feature

https://gerrit.wikimedia.org/r/416627

Change 416627 merged by jenkins-bot:
[mediawiki/extensions/NavigationTiming@master] ext.NavigationTiming: Add integration test for oversampling feature

https://gerrit.wikimedia.org/r/416627