Page MenuHomePhabricator

Enable more nuanced targeting and sampling on Quicksurveys
Open, Needs TriagePublic

Description

Background

The Quicksurveys extension currently has limited targeting functionality. Highly active editors have expressed frustration with the survey tool excessively sampling them, which can result in lack of engagement on our surveys. More nuanced targeting and sampling by the tool would alleviate these frustrations.

User Story
TBD

Requirements (DRAFT)
Add the following funcitonality:

  • Ability to target finer edit count ranges:
    • within specified time periods
    • by edit or visit recency
    • by edit types, role (e.g. admin)
  • Ability to target specific users from a list of usernames
  • Ability to set specific sampling rates based on targeting criteria
  • Ability to stop the tool from serving surveys when specific quotas have been reached
  • Ability to trigger a survey right after an edit is saved.

Acceptance criteria
TBD

Event Timeline

Change #1029713 had a related patch set uploaded (by Mimurawil; author: Mimurawil):

[mediawiki/core@master] [WIP] T362646: Enable more nuanced targeting and sampling on QuickSurveys

https://gerrit.wikimedia.org/r/1029713

hi @Ladsgroup!

Sai Suman asked me to reach out to you when we had a first draft of the query that we would like to complete the Audience tuning requirements for QuickSurvey.

Context
We are working on enhancing the QuickSurvey extension and as part of this, we have been asked to provide admin the possibility to define surveys that are just shown to people that have a certain level of activity. In this case activity is defined as the edit count within a certain period. So for example they may want to send survey to people that have just started editing (so with a count of 5 per month), or to people that are very active (eg 25 per month).

Proposed solution
While discussing this with the team, we decided to see if it would be possible to enhance the current editUserCount to return more values (the timestamp of edits) and what the implications of this action would be on the database.

Proposed Implementation
We implemented a WIP PR that shows the query we were considering. The idea is to retrieve the user Count, grouped by "rev_timestamp." As far as we could see, there seems to already be an index on this table, so we hope it will be okay.

What we need from you
We need your expertise in this part of the system to discuss whether our proposal is good, whether you have other alternatives, how we would handle the different DBs (see commented-out code), and what caching strategies you would suggest (Saving it in a variable as it does for the count could not be the best as this returned query will be much larger).

I like it in Uk, but I am happy to jump on a call at any time if that suits you more!

NOTE: The request was to retrieve the information for a specific DATA range, but we thought it best NOT to pass this variable as it will make caching impossible (or more complicated).

Hi,
The general idea looks good to me, I have some notes about the implementation:

  • First, it's better to look how product needs it in more depth: For example, getting users that have made x edits in the last month or last 90 days is different technical implementation than users who made x edits in April 2024.
  • Second, it still might lead to a lot of reads

For example:

MariaDB [wikidatawiki_p]> select count(*) from revision where rev_actor = 931 and rev_timestamp like '202403%';
+----------+
| count(*) |
+----------+
|      844 |
+----------+
1 row in set (1 min 46.222 sec)

And my bot is not that active.

We need some mitigations:

    • Make sure bots are never queried.
    • Don't do range look ups, in our setup dates are stored as strings (BINARY(14)) so comparisons are slow
    • Maybe limit it to last one year or so, this is a product question: Do you need all years? I've been active for 18 years now, do you really need my edits divided by months for the past two decades? We can start with a year and then add more if more is needed.
  • Make sure you implement the caching, but that comes with caching invalidation, have fun with that.

Overall, I think we can find a working solution together, but I need to know more about the product requirements in here.

Thank you so much for your answer.

I spoke with Tanja and agreed that returning just the last 2 years of edits by month should be fine and not the full lifetime.

I just wanted to clarify that unfortunately, we will need to get information by data range and not a specific month. So, in our query, we are expecting to return a list of edits and then perform some normalization in PHP to group them by month.

From what you said above about the dates being a string, there is no simple way for us to create a query that automatically groups the edits by months, right? We should just return edits (e.g., '2023') and then do the grouping of the month in PHP.

Oh no, that would explode the database and appservers if you query for an active editor. you can just do a left() in the sql. Just query group by left(rev_timestamp, 6) and get the count. I know it won't work with PG but it shouldn't be too hard to fix :P

This works on my localhost:

MariaDB [client]> select left(rev_timestamp, 6), count(*) from revision where rev_actor = 1 group by left(rev_timestamp, 6);
+------------------------+----------+
| left(rev_timestamp, 6) | count(*) |
+------------------------+----------+
| 202403                 |        1 |
| 202404                 |        2 |
+------------------------+----------+
2 rows in set (0,001 sec)

fantastic thank you for your help, we will start to work on this and then provide more feedback!

Do you know if there is a simple way to "fill up" a local database with dataset? We want to make sure the code we run is as performant as possible, and we do not know how to achieve this with just the local database

Do you know if there is a simple way to "fill up" a local database with dataset? We want to make sure the code we run is as performant as possible, and we do not know how to achieve this with just the local database

Unfortunately not. You can use Special:Export to export a couple of pages in a wiki and import them in your local setup. But it's much more limited than what you get in production. We also have beta cluster that has a decent chunk of data and we can test stuff once they are merged (but not yet deployed to production) but both are rather small compared to what you get in production and its edge cases and issues.

Hi Amir, we did some of the work and we have a couple of quick questions for you regarding the above work:

  • We are able to create the query (User edit by months) by either casting the rev_timestamp into a number and comparing that (see code below-named buildIntegerCast) or we can just make a like comparison (see like comparison below). Do know which one performs better between the two?
  • We could not find a way to fo "left" using Wikimedia ORM and we used "SUBSTR" instead. Is that ok? or can you suggest a better method to use
  • When performing the LIKE query, we did not find an easy way to perform the LIKE IN and we actually used LIKE or LIKE instead. Is that ok? or can you suggest anything else?

Thank you as always for your support

BuildIntegerCast

->where( [
				'rev_actor' => $actorId,
				$dbr->expr(
					$dbr->buildIntegerCast( $dbr->buildSubString( 'rev_timestamp', 1, 6 ) ),
					'>=',
					$dbr->expr( $dbr->buildIntegerCast( $dbr->buildSubString( 'rev_timestamp', 1, 6 ) ), '-', 200 )
				)
			] )

Like comparison

WHERE
	rev_actor = 3
	AND (
		(SUBSTR(rev_timestamp, 1, 4) = '2022')
		OR (SUBSTR(rev_timestamp, 1, 4) = '2023')
		OR (SUBSTR(rev_timestamp, 1, 4) = '2024'))

Change #1032553 had a related patch set uploaded (by Amdrel; author: Amdrel):

[mediawiki/core@master] [WIP] Added the wgUserEditCountByMonth config key

https://gerrit.wikimedia.org/r/1032553

Amdrel changed the task status from Open to In Progress.May 16 2024, 7:21 PM

[@SimoneThisDot: You may want to explicitly ping folks so they can filter their notifications, and use their usernames as other names can be ambiguous. Thanks!]

@Ladsgroup Did you see the above message? Sorry, I forgot to reference you!

Also, I have replied to your question about lazyloading the data. Is there any chance you can help provide some context on how you would suggest implementing it? (or link it to existing extension that do so?)

thank you!

Hi, I responded in the patch. hope that'd be useful for you.

We could not find a way to fo "left" using Wikimedia ORM and we used "SUBSTR" instead. Is that ok? or can you suggest a better method to use

left takes advantage of indexes, substr probably wouldn't (I couldn't find any docs on it, probably need to look at the source code).

Regarding the query, you can probably just do multiple ones in the past couple of years. That should be fine.

Change #1032553 abandoned by Amdrel:

[mediawiki/core@master] [WIP] Added the wgUserEditCountByMonth config key

Reason:

We ended up going with a different approach that doesn't require us to make any changes to core.

https://gerrit.wikimedia.org/r/1032553

SimoneThisDot changed the task status from In Progress to Open.Jun 3 2024, 6:39 AM
SimoneThisDot changed the status of subtask T364954: Create "EditCountByMonth" method from In Progress to Open.
SimoneThisDot changed the status of subtask T364955: Enhance QuickSurvey targeting from Open to In Progress.

Change #1029713 abandoned by Mimurawil:

[mediawiki/core@master] [WIP] T362646: Enable more nuanced targeting and sampling on QuickSurveys

Reason:

Patch demo not needed anymore

https://gerrit.wikimedia.org/r/1029713