Maniphest T272069

[Spike] What should our sampling strategy be for session_tick?
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• sdkim
	Jan 14 2021, 7:13 PM

Description

Context

Once session_tick tracking is rolled out to all wikis, the dataset will quickly become too large if we don't institute sampling. Details are in Marcel's comment here: https://phabricator.wikimedia.org/T271455#6741794

As we sample, we need to consider how representative the data is for key breakdowns of the data (e.g. wiki, country, day). Is there a minimum threshold for sessions we should set? What do we know or can we learn about activity within key breakdowns to inform sampling strategy?

Decision

Initial rollout: 1/100 sampling; we will check the amount of events coming in. If the volume is within projected levels, we will move to 1/10 sampling.
If needed, we will adjust sampling on a per wiki basis once we have a better sense of event volume.
See comments T272069#6763490 and T272069#6801863 for additional details.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		kzimmerman	T267494 [Session Length v1] View how long users interact with our products
		Resolved		kzimmerman	T272069 [Spike] What should our sampling strategy be for session_tick?

Event Timeline

• sdkim created this task.Jan 14 2021, 7:13 PM

kzimmerman renamed this task from [Spike] When should we sample? to [Spike] What should our sampling strategy be for session_tick?.Jan 14 2021, 7:30 PM

kzimmerman removed kzimmerman as the assignee of this task.

kzimmerman removed a project: Goal.

kzimmerman updated the task description. (Show Details)

• sdkim moved this task from Inbox to To Do on the Better Use Of Data board.Jan 14 2021, 8:19 PM

LGoto assigned this task to mpopov.Jan 19 2021, 6:09 PM

LGoto triaged this task as Medium priority.

LGoto moved this task from Triage to Needs Investigation on the Product-Analytics board.

kzimmerman moved this task from To Do to Doing on the Better Use Of Data board.Jan 20 2021, 7:33 PM

mpopov edited projects, added Product-Analytics (Kanban); removed Product-Analytics.Jan 20 2021, 8:35 PM

mpopov moved this task from Next 2 weeks to Doing on the Product-Analytics (Kanban) board.

Based on @mforns's excellent investigation of (in)accuracy that each sampling rate yields:

sessiom length sampling study.png (868×1 px, 220 KB)

This study suggests that 10% of sessions would be an acceptable sampling, but this is based on the initial rollout to group0 and group1 wikis.

Stream configurations enable us to adjust sampling rates rapidly, through the MW config deployments which are much more frequent than code releases. We should start with a conservative rate like 1% (0.01 in the config) when we deploy to all wikis and work from there. By looking at the volume of events/data at that rate we will be able to make a decision about increasing it for some wikis or even making it smaller for others (e.g. 0.1% for enwiki, even).

Also! Since the underlying sampling algorithm uses decimals like 0.5 and 0.25 instead of fractions like "1 in 2" or "1 in 4", we can even arrive at "weird" sampling rates like 17% or 34% by gradually increasing the sampling rate in increments of 0.5% or 1% and seeing how the system handles the resulting changes.

Side note for @Ottomata and @jlinehan: this might be a great opportunity to do a stress test of the system?

mpopov moved this task from Inbox to Done on the Product-Data-Infrastructure board.Jan 20 2021, 8:53 PM

• sdkim moved this task from Doing to Sign-off on the Better Use Of Data board.Jan 27 2021, 7:21 PM

• sdkim reassigned this task from mpopov to kzimmerman.Jan 27 2021, 7:33 PM

The resolution here was was to start with a conservative rate (1%) and go up from there. However, Marcel added a comment that the Analytics Engineering team believes the pipeline could take up to 1/10 of all session tick data:

In T271455#6762361, @mforns wrote:

I discussed with the Analytics team about how much session tick data could our pipeline possibly take, given that high sampling rates result in very low accuracy for our metric.
The conclusion that we arrived is that the pipeline could take up to 1/10 of all session tick data - theoretically without problems - as it is today.
If necessary, we could consider collecting all data un-sampled, but that would probably require some extra work, i.e. scaling EventGate, making sure session tick stream does not starve smaller streams, when importing and processing the data, setting up a specific deletion schedule to avoid keeping TB of data, etc.

Given the above, I believe we should start with 1/10 sampling and adjust as needed.

I still think that we can take 1/10 sampling rate, but nevertheless it would be nice to start at 1/100, just for 1 or 2 days, to be safe... If my projections are *not* correct, this might have the potential to collapse parts of the data collection/processing system. But, as @mpopov said, good news is that we can change the production sampling rate just with a MediaWiki-config change, we don't need to wait for a full MediaWiki deployment train. So, tl;dr, yes 1/10, but let's roll it our progressively.

Thanks for the clarification, @mforns! I'll update the task description with a brief summary.

kzimmerman closed this task as Resolved.Feb 4 2021, 4:40 PM