Page MenuHomePhabricator

Update per-domain uniques fresh-sessions computation
Closed, ResolvedPublic3 Estimated Story Points

Description

Currently on per-domain uniques, fresh sessions computation named offset counts fingerprinted sessions having made:

  • 1 request with no cookies set (nocookie IS NOT NULL)
  • 0 request with some cookies set (nocookie IS NULL).

This way of computing the offset undercounts the fresh sessions.
While making sure we count only devices having made 1 request with no cookies set (nocookie IS NOT NULL) is correct, restricting the number by counting only devices having made 0 other request prevents counting devices whose "fresh" session includes more than 1 hit, about 10% of the offset.

Fresh sessions with more than 1 hit will have a first hit with nocookies=1 and that will set the cookie of that day, say, if current date is May 3rd it will set last-access cookie to May 3rd. The subsequent requests from that session have last access set to May 3rd and thus, are not counted towards uniques of that day (only requests whose date is less than current date get counted)

Move to production:

  • Add a row in documentation about the change in this page

Event Timeline

Nuria set the point value for this task to 3.Jun 5 2017, 3:26 PM

Change 356823 had a related patch set uploaded (by Nuria; owner: Joal):
[analytics/refinery@master] Correct per-domain unique devices jobs

https://gerrit.wikimedia.org/r/356823

Milimetric triaged this task as Medium priority.Jun 22 2017, 3:09 PM

@JAllemandou Did the "about 10% of the offset" estimate in the task description refer to the daily metric?
For the monthly unique devices, the impact may have been much larger (looking at the total uniques_estimate - haven't examined the offset part separately yet):

enwiki total unique devices - year-over-year changes -April 2018.png (371×600 px, 16 KB)

(This is for enwiki, adding mobile and desktop - i.e. the number we have been tracking as a core metrics in the monthly board metrics report until recently. There are other weird fluctuations here too, also when looking at mobile and desktop separately, which led to the conclusion that year-over-year comparisons for per-domain uniques should not be relied upon too much - certainly if they span the point in time where this fix was implemented in June 2017.)

The correction that this bug is about affects only the "offset" of the unique devices calculation. Not the under_estimate (see definition of fix, it only affects fresh sessions).

Seems that your graph above is plotting "underestimate+offset". If you want to see the effect of this bug best way would be graphing offset alone before and after correction for per-domain uniques. Daily or Monthly.

The effect of bugfix on unique devices metric will depend of the percentage of the unique devices total that is derived from the offset.

The offset correction represents a higher percentage of uniques the longer the timespan is so it is bigger for "monthly devices" that it is for "daily" ones.

See:
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Unique_Devices/Last_access_solution#How_big_of_a_percentage_does_the_offset_represent_from_the_total?

The correction that this bug is about affects only the "offset" of the unique devices calculation. Not the under_estimate (see definition of fix, it only affects fresh sessions).

Seems that your graph above is plotting "underestimate+offset". If you want to see the effect of this bug best way would be graphing offset alone before and after correction for per-domain uniques. Daily or Monthly.

Well yes, I had already pointed that out above ("looking at the total uniques_estimate - haven't examined the offset part separately yet", my plot was about the overall metric because that happened to be the data at whose trends I was looking at). I have now made a plot: T169550#4729095 - seems it increased the monthly offset part on enwiki by about 60%.