Page MenuHomePhabricator

Investigation of QS sampling capabilities [16h]
Closed, ResolvedPublicSpike

Description

This task represents the work with conducting an analysis to learn how QuickSurvey sampling works.

User Story
As a data analyst I want to be able to send the survey to a sample of users so that I can get statistically viable results.

What we know
We’re looking to show the survey to a random sample of logged-in users who have made x amount of edits in the last x months at the time of the sampling

  • QuickSurveys samples global users based on a percentage, and then filters logged in users and other criteria to see if they should see the survey This means that for a given survey, if we choose a sampling rate of 0.4, 40% of all users will be bucketed for the survey. Of that 40%, only the ones that match other criteria, like being logged in, will have the survey shown to them.
  • Sampling is done based on session tokens, which are semi-permanent and let us get a percentage of device sessions, rather than a percentage of pageviews

Open Questions

  • Can we sample based on logged-in status?
    • βœ… Yes, configuring the audience like so:
	'audience' => [
		'anons' => false,
    ]
  • Can we sample based on edit count?
    • βœ… Yes, configuring the audience like so:
	'audience' => [
		'minEdits' => 0,
    ]
  • Can we sample based on a time frame for a number of edits?
    • Not easily
    • Investigation in progress... ⏳
    • Alternative criteria that could proxy this requirement:
      • getLatestEditTimestamp() *
        • We could expose the last time a user edited and filter users based on that and minEdits

Event Timeline

Madalina renamed this task from Investigation of OS sampling capabilities to Investigation of QS sampling capabilities.Sep 21 2021, 3:48 PM
Madalina renamed this task from Investigation of QS sampling capabilities to Investigation of QS sampling capabilities [16h].Sep 22 2021, 2:58 PM

@Jhernandez My understanding of the answer to these questions as of now is:

  • Can we sample based on logged-in status? yes (can set anon to false)
    • Can we sample based on edit count? yes (can set minEdits and maxEdits)
    • Can we sample based on a time frame for a number of edits? no

Does that look right to you?

I'd imagine that would mean the next step for this task would be investigating how hard it would be to sample based on recent edits within a time frame as opposed to all time edits.

@Jhernandez My understanding of the answer to these questions as of now is:

  • Can we sample based on logged-in status? yes (can set anon to false)
    • Can we sample based on edit count? yes (can set minEdits and maxEdits)
    • Can we sample based on a time frame for a number of edits? no

Does that look right to you?

All sound right to me πŸ‘

I'd imagine that would mean the next step for this task would be investigating how hard it would be to sample based on recent edits within a time frame as opposed to all time edits.

I went investigating yesterday, and besides the currently used wgUserEditCount in mw.config I couldn't find anything else readily available related to what they call "active users" (users who made X edits in the last Y months).

One of the rabbit holes I looked at was this:

The way it does that is by querying the table querycachetwo and joining the user data with the actor table for the user info and the recent changes table for the activity ("edits") within the time period with a count (see function getQueryInfo).

Given all this, we could do something similar for the logged in user in the PHP side and set the variable for the JS side to bucket users, but we would need to investigate more specifically the performance implications and how they would impact the implementation. I imagine we'd want to cache this number somewhere to avoid running this query for every logged in user page hit.

There may also be other options I haven't seen yet. I'll keep poking at it.

More things that could be useful, using something like getLatestEditTimestamp() * we could also expose the last time a user did something and filter users based on that. It is likely less costly and could be a proxy for the active user kind of metric.

  • Can we sample based on logged-in status?
    • βœ… Yes, configuring the audience like so:
	'audience' => [
		'anons' => false,
    ]
  • Can we sample based on edit count?
    • βœ… Yes, configuring the audience like so:
	'audience' => [
		'minEdits' => 0,
    ]
  • Can we sample based on a time frame for a number of edits?
    • Not easily
    • Investigation in progress... ⏳
    • Alternative criteria that could proxy this requirement:
      • getLatestEditTimestamp() *
        • We could expose the last time a user edited and filter users based on that and minEdits

I need to do a bit more digging into ActiveUserPager and what it would take to expose the user-edits-in-last-active-days metric but if I get lost on a rabbit hole I'll come back and reconvene to either make a new spike or add time to this one.

@Madalina I moved this to review because I've reached the timebox. You can see the responses up there.

I can continue investigating the amount of edits in the last x months but we probably want to make a new spike targeted on that.

Sounds good to me. I'll create a new spike ticket for the edits in the last x months issue.

I'll close this as we have answered the first two questions. The third question needs more investigation and we opened a new ticket for it: T292084