Page MenuHomePhabricator

[Spike] What should our sampling strategy be for session_tick?
Closed, ResolvedPublic

Description

Context

Once session_tick tracking is rolled out to all wikis, the dataset will quickly become too large if we don't institute sampling. Details are in Marcel's comment here: https://phabricator.wikimedia.org/T271455#6741794

As we sample, we need to consider how representative the data is for key breakdowns of the data (e.g. wiki, country, day). Is there a minimum threshold for sessions we should set? What do we know or can we learn about activity within key breakdowns to inform sampling strategy?

Decision

Initial rollout: 1/100 sampling; we will check the amount of events coming in. If the volume is within projected levels, we will move to 1/10 sampling.
If needed, we will adjust sampling on a per wiki basis once we have a better sense of event volume.
See comments T272069#6763490 and T272069#6801863 for additional details.

Event Timeline

kzimmerman renamed this task from [Spike] When should we sample? to [Spike] What should our sampling strategy be for session_tick?.Jan 14 2021, 7:30 PM
kzimmerman removed kzimmerman as the assignee of this task.
kzimmerman removed a project: Goal.
kzimmerman updated the task description. (Show Details)
LGoto triaged this task as Medium priority.
LGoto moved this task from Triage to Needs Investigation on the Product-Analytics board.
mpopov added a subscriber: mforns.

Based on @mforns's excellent investigation of (in)accuracy that each sampling rate yields:

sessiom length sampling study.png (868×1 px, 220 KB)

This study suggests that 10% of sessions would be an acceptable sampling, but this is based on the initial rollout to group0 and group1 wikis.

Stream configurations enable us to adjust sampling rates rapidly, through the MW config deployments which are much more frequent than code releases. We should start with a conservative rate like 1% (0.01 in the config) when we deploy to all wikis and work from there. By looking at the volume of events/data at that rate we will be able to make a decision about increasing it for some wikis or even making it smaller for others (e.g. 0.1% for enwiki, even).

Also! Since the underlying sampling algorithm uses decimals like 0.5 and 0.25 instead of fractions like "1 in 2" or "1 in 4", we can even arrive at "weird" sampling rates like 17% or 34% by gradually increasing the sampling rate in increments of 0.5% or 1% and seeing how the system handles the resulting changes.

Side note for @Ottomata and @jlinehan: this might be a great opportunity to do a stress test of the system?

The resolution here was was to start with a conservative rate (1%) and go up from there. However, Marcel added a comment that the Analytics Engineering team believes the pipeline could take up to 1/10 of all session tick data:

I discussed with the Analytics team about how much session tick data could our pipeline possibly take, given that high sampling rates result in very low accuracy for our metric.
The conclusion that we arrived is that the pipeline could take up to 1/10 of all session tick data - theoretically without problems - as it is today.
If necessary, we could consider collecting all data un-sampled, but that would probably require some extra work, i.e. scaling EventGate, making sure session tick stream does not starve smaller streams, when importing and processing the data, setting up a specific deletion schedule to avoid keeping TB of data, etc.

Given the above, I believe we should start with 1/10 sampling and adjust as needed.

I still think that we can take 1/10 sampling rate, but nevertheless it would be nice to start at 1/100, just for 1 or 2 days, to be safe... If my projections are *not* correct, this might have the potential to collapse parts of the data collection/processing system. But, as @mpopov said, good news is that we can change the production sampling rate just with a MediaWiki-config change, we don't need to wait for a full MediaWiki deployment train. So, tl;dr, yes 1/10, but let's roll it our progressively.

Thanks for the clarification, @mforns! I'll update the task description with a brief summary.

kzimmerman updated the task description. (Show Details)