Investigation of QS sampling capabilities [16h]
Closed, ResolvedPublicSpike
Actions

Assigned To

Authored By

	Madalina
	Sep 21 2021, 3:48 PM

Description

This task represents the work with conducting an analysis to learn how QuickSurvey sampling works.

User Story
As a data analyst I want to be able to send the survey to a sample of users so that I can get statistically viable results.

What we know
We’re looking to show the survey to a random sample of logged-in users who have made x amount of edits in the last x months at the time of the sampling

QuickSurveys samples global users based on a percentage, and then filters logged in users and other criteria to see if they should see the survey This means that for a given survey, if we choose a sampling rate of 0.4, 40% of all users will be bucketed for the survey. Of that 40%, only the ones that match other criteria, like being logged in, will have the survey shown to them.

Sampling is done based on session tokens, which are semi-permanent and let us get a percentage of device sessions, rather than a percentage of pageviews

Open Questions

Can we sample based on logged-in status?
- ✅ Yes, configuring the audience like so:

	'audience' => [
		'anons' => false,
    ]

Can we sample based on edit count?
- ✅ Yes, configuring the audience like so:

	'audience' => [
		'minEdits' => 0,
    ]

Can we sample based on a time frame for a number of edits?
- Not easily
- Investigation in progress... ⏳
- Alternative criteria that could proxy this requirement:
  - getLatestEditTimestamp() *
    - We could expose the last time a user edited and filter users based on that and minEdits

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Madalina	T293972 [EPIC] Research QS and validate as distribution tool for the safety survey
		Resolved	Spike	• Jhernandez	T291500 Investigation of QS sampling capabilities [16h]

Event Timeline

Madalina created this task.Sep 21 2021, 3:48 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 21 2021, 3:48 PM

Madalina renamed this task from Investigation of OS sampling capabilities to Investigation of QS sampling capabilities.Sep 21 2021, 3:48 PM

• Jhernandez moved this task from Needs triage to Kanban on the Trust and Safety Tools Team Backlog board.Sep 22 2021, 2:37 PM

• Jhernandez edited projects, added Trust and Safety Tools Team Backlog (Kanban); removed Trust and Safety Tools Team Backlog.

Madalina renamed this task from Investigation of QS sampling capabilities to Investigation of QS sampling capabilities [16h].Sep 22 2021, 2:58 PM

Madalina updated the task description. (Show Details)Sep 22 2021, 4:56 PM

@Jhernandez My understanding of the answer to these questions as of now is:

Can we sample based on logged-in status? yes (can set anon to false)
- Can we sample based on edit count? yes (can set minEdits and maxEdits)
- Can we sample based on a time frame for a number of edits? no

Does that look right to you?

I'd imagine that would mean the next step for this task would be investigating how hard it would be to sample based on recent edits within a time frame as opposed to all time edits.

In T291500#7372856, @mepps wrote:

@Jhernandez My understanding of the answer to these questions as of now is:

Can we sample based on logged-in status? yes (can set anon to false)

Can we sample based on edit count? yes (can set minEdits and maxEdits)

Can we sample based on a time frame for a number of edits? no

Does that look right to you?

All sound right to me 👍

I'd imagine that would mean the next step for this task would be investigating how hard it would be to sample based on recent edits within a time frame as opposed to all time edits.

I went investigating yesterday, and besides the currently used wgUserEditCount in mw.config I couldn't find anything else readily available related to what they call "active users" (users who made X edits in the last Y months).

One of the rabbit holes I looked at was this:

Manual:Interface/JavaScript#mw.config and the available variables
$wgActiveUserDays and related uses, which led me to...
class SpecialActiveUsers, which hosts Special:ActiveUsers, a page that lists users and "recent" edits, which internally uses...
class ActiveUsersPager which is the class that actually gets the data

The way it does that is by querying the table querycachetwo and joining the user data with the actor table for the user info and the recent changes table for the activity ("edits") within the time period with a count (see function getQueryInfo).

Given all this, we could do something similar for the logged in user in the PHP side and set the variable for the JS side to bucket users, but we would need to investigate more specifically the performance implications and how they would impact the implementation. I imagine we'd want to cache this number somewhere to avoid running this query for every logged in user page hit.

There may also be other options I haven't seen yet. I'll keep poking at it.

• Jhernandez updated the task description. (Show Details)Sep 23 2021, 11:20 AM

More things that could be useful, using something like getLatestEditTimestamp() * we could also expose the last time a user did something and filter users based on that. It is likely less costly and could be a proxy for the active user kind of metric.

Madalina added subscribers: • eigyan, • ERayfield, ARamirez_WMF.Sep 23 2021, 2:25 PM

• Jhernandez claimed this task.Sep 24 2021, 12:33 PM

• Jhernandez moved this task from 🎬 Ready to 💻 In Progress on the Trust and Safety Tools Team Backlog (Kanban) board.

Can we sample based on logged-in status?
- ✅ Yes, configuring the audience like so:

	'audience' => [
		'anons' => false,
    ]

Can we sample based on edit count?
- ✅ Yes, configuring the audience like so:

	'audience' => [
		'minEdits' => 0,
    ]

Can we sample based on a time frame for a number of edits?
- Not easily
- Investigation in progress... ⏳
- Alternative criteria that could proxy this requirement:
  - getLatestEditTimestamp() *
    - We could expose the last time a user edited and filter users based on that and minEdits

I need to do a bit more digging into ActiveUserPager and what it would take to expose the user-edits-in-last-active-days metric but if I get lost on a rabbit hole I'll come back and reconvene to either make a new spike or add time to this one.

• Jhernandez moved this task from 💻 In Progress to 🔍 Review/Feedback on the Trust and Safety Tools Team Backlog (Kanban) board.Sep 27 2021, 3:36 PM

@Madalina I moved this to review because I've reached the timebox. You can see the responses up there.

I can continue investigating the amount of edits in the last x months but we probably want to make a new spike targeted on that.

Sounds good to me. I'll create a new spike ticket for the edits in the last x months issue.

• Jhernandez moved this task from 🔍 Review/Feedback to 🤘 Done on the Trust and Safety Tools Team Backlog (Kanban) board.Sep 29 2021, 9:49 AM

Madalina mentioned this in T292084: Investigation of QS sampling based on time frame capabilities.Sep 29 2021, 3:54 PM

I'll close this as we have answered the first two questions. The third question needs more investigation and we opened a new ticket for it: T292084

Madalina added a parent task: T293972: [EPIC] Research QS and validate as distribution tool for the safety survey.Oct 21 2021, 8:28 AM

Madalina added a project: WMF-Safety-Survey.Oct 25 2021, 4:56 PM

Investigation of QS sampling capabilities [16h]Closed, ResolvedPublicSpikeActions

Description

Open Questions

Related ObjectsSearch...

Event Timeline

Investigation of QS sampling capabilities [16h]
Closed, ResolvedPublicSpike
Actions

Related Objects
Search...