Page MenuHomePhabricator

Sample by title in QuickSurvey
Open, NormalPublic

Description

Add sampling by title options to QuickSurvey.

Why?

  • Sampling by page_id/title is important as we are interested in studying if different users have different trajectories on Wikipedia and whether the differences can be described depending on the users' socio-demographics.

What's already there?
Client-side country information is available under mw.config.get( 'wgPageName' ) which could be compared to new survey configuration keys.

Acceptance criteria

  • I can run a survey on Special:RecentChanges
  • I can run a survey on an arbitrary page.

Developer notes

Using page_title would be prefered as special pages do not have page ids and T146495 requests that specifically.

Event Timeline

leila triaged this task as High priority.Jan 14 2019, 12:44 PM
leila created this task.
leila added a subscriber: JKatzWMF.

@JKatzWMF before the holidays, we discussed looking into adding two features to QuickSurvey sampling choices. I've created this task. Could you help us directing this to folks who can help us estimating how much work this will be?

Thanks @leila I agree this would be very useful. I'd also like to piggyback on the momentum here, if I may, to raise the need for T186737 as well. If we're exploring the feasibility of QS improvements, I'd love to see that one considered as well.

Jdlrobson renamed this task from Sample by country and page_id in QuickSurvey to Sample by page_id in QuickSurvey.Jan 15 2019, 6:00 PM
Jdlrobson updated the task description. (Show Details)
Jdlrobson added a subscriber: Jdlrobson.

I've split out T213847 since we have separate tasks for all the other types of bucketing requests (which I've listed under the "Bucketing" column). The problem as always is getting developer time to flesh out QuickSurveys.

ovasileva moved this task from Upcoming to Tracking on the Readers-Web-Backlog board.
Jdlrobson renamed this task from Sample by page_id in QuickSurvey to Sample by title in QuickSurvey.Feb 6 2019, 6:27 PM
Jdlrobson updated the task description. (Show Details)

This is an example configuration currently supported: https://www.mediawiki.org/wiki/Extension:QuickSurveys#Configuration.

@leila, can you provide a sample config for how this should work given the current configuration design? Some questions that came up are:

  1. Will the sampling be different based on page title? For instance, perhaps the Barack Obama page is sampled at 75% on / off and the Dog page is sampled on / off at 50%?
  2. How does the title work across languages?
  3. Do we need to support multiple buckets with different sampling ratios or just A/B/C with even ratios (33%/33%/33%)?
Isaac added a comment.Feb 12 2019, 8:20 PM

Thanks @Niedzielski! Jumping in with some thoughts below:

  1. Will the sampling be different based on page title? For instance, perhaps the Barack Obama page is sampled at 75% on / off and the Dog page is sampled on / off at 50%?

Differential sampling by page would actually be great if it's not too difficult or greatly slows down the extension. Without differential sampling, if we want to include very popular pages alongside less popular, we would end up greatly over-sampling the readers of the more popular page. There are hacks around this (multiple surveys, each with coverage inverse to the page view counts of the articles they include), but they are messy.

  1. How does the title work across languages?

Could you expand a bit here about our options? We would prefer page ID as the identifier (no language issues, much easier for joining with other datasets in analysis) but I recognize that this does not work with Special Pages.

  1. Do we need to support multiple buckets with different sampling ratios or just A/B/C with even ratios (33%/33%/33%)?

If I understand correctly, we have coverage, which is essentially the split between "control" (no survey) and "test" (see a survey) users. And then you're asking, for those who are selected to see a survey, do we need to split them up in arbitrary ways? No, I think even ratios will be sufficient. Worse case, if we end up needing the flexibility, we can still direct different buckets to the same place (e.g., if A and B are the same, then functionally we have created a 66%-33% split).

ovasileva lowered the priority of this task from High to Normal.Feb 19 2019, 3:55 PM

o/ @Isaac

For my own reference, here's the current config in the QuickSurveys extension.json:

Configuration
"QuickSurveysConfig": [
	{
		"@name": "survey name",
		"name": "internal example survey",
		"@type": "internal or external link survey",
		"type": "internal",
		"@question": "survey question message key",
		"question": "ext-quicksurveys-example-internal-survey-question",
		"@description": "The message key of the description of the survey. Displayed immediately below the survey question.",
		"description": "ext-quicksurveys-example-internal-survey-description",
		"@answers": "possible answer message keys for positive, neutral, and negative",
		"answers": [
			"ext-quicksurveys-example-internal-survey-answer-positive",
			"ext-quicksurveys-example-internal-survey-answer-neutral",
			"ext-quicksurveys-example-internal-survey-answer-negative"
		],
		"@enabled": "whether the survey is enabled",
		"enabled": false,
		"@coverage": "percentage of users that will see the survey",
		"coverage": 0.5,
		"@platforms": "for each platform (desktop, mobile), which version of it is targeted (stable, beta)",
		"platforms": {
			"desktop": ["stable"],
			"mobile": ["stable", "beta"]
		}
	},
	{
		"name": "external example survey",
		"@type": "internal or external link survey",
		"type": "external",
		"@question": "survey question message key",
		"question": "ext-quicksurveys-example-external-survey-question",
		"@description": "the i18n key of the description of the survey",
		"description": "ext-quicksurveys-example-external-survey-description",
		"@link": "external link to the survey",
		"link": "ext-quicksurveys-example-external-survey-link",
		"@instanceTokenParameterName": "parameter to add to link",
		"instanceTokenParameterName": "parameterName",
		"@privacyPolicy": "The i18n key of the privacy policy text.",
		"privacyPolicy": "ext-quicksurveys-example-external-survey-privacy-policy",
		"@enabled": "whether the survey is enabled",
		"enabled": false,
		"@coverage": "percentage of users that will see the survey",
		"coverage": 0.5,
		"@platforms": "for each platform (desktop, mobile), which version of it is targeted (stable, beta)",
		"platforms": {
			"desktop": ["stable"],
			"mobile": ["stable", "beta"]
		}
	}
]

And documentation.


1
Differential sampling by page would actually be great if it's not too difficult or greatly slows down the extension. Without differential sampling, if we want to include very popular pages alongside less popular, we would end up greatly over-sampling the readers of the more popular page. There are hacks around this (multiple surveys, each with coverage inverse to the page view counts of the articles they include), but they are messy.

Sorry, I don't understand but an example would help me. Do you want something like this?

{
  "control": .5,
  "Barack_Obama": .1,
  "Apple": .25,
  "Dog": .15
}

Or something where the control group is per page:

{
  "Barack_Obama": .01, // control is .99 for this page
  "Apple": .55, // control is .45 for this page
  "Dog": .85 // control is .15 for this page
}

Or something where bucketing and sample rate is specified per page:

{
  "Barack_Obama": {"samplingRate": 1, "bucketingRate": .1}, // For this page, all users are bucketed at 10%, control is 90%.
  "Apple":  {"samplingRate": .5, "bucketingRate": .25}, // For this page, half of users are bucketed at 25%, control is 75%.
  "Dog":  {"samplingRate": 0, "bucketingRate": .95},  // For this page, no users are bucketed at 95%, control is 5%.
}

Or something else? Please keep in mind the existing configuration format.


2
Could you expand a bit here about our options? We would prefer page ID as the identifier (no language issues, much easier for joining with other datasets in analysis) but I recognize that this does not work with Special Pages.

It is my understanding that page IDs will vary by wiki too and will need to be identified in advance of each survey. Does that make sense to you? For example, if you wanted to target the Barack Obama, Apple (fruit), and Dog articles on enwiki and equivalent articles on eswiki, you will need to identify the relevant article names per wiki and then query the page IDs (see example). These page IDs would then be specified in the proposed QuickSurveys config. Continuing the example, something like:

...
		"pageIDs": [534366, 18978754, 4269567],
...
		"pageIDs": [430434, 50309, 6081753],
...

It goes without saying that if you later wish to isolate a specific page across wikis for analysis, you would need to take care to pick the correct ID since a query probably wouldn't know how to order them consistently across wikis.

While better than page titles, page IDs still seem unfun and error-prone given all the other configuration necessary. I've only looked at a couple examples but consider this recent survey configuration and even this more consolidated configuration too.

I don't think the Wikidata Q item identifier varies though and you could have a single configuration that worked across all wikis, assuming good Wikidata sitelinks. For example, Apple (fruit) is Q89. If the Wikibase extension is installed, I believe QuickSurveys could obtain the current page's identifier via wgWikibaseItemId and see if it is present in the configuration. The configure might look like:

...
		"itemIDs": ["Q76", "Q89", "Q144"],
...

This probably won't work well for obscure pages as pages without Q codes or missing sitelinks cannot be considered. Still, unless you plan to intentionally test lots of unpopular pages, using the Q item ID seems like the best approach to me as you cannot guarantee a page will exist on all wikis even with page IDs or page titles.

Note also that QuickSurveys does not work on Special pages.

By the way, how many pages will be specified at once? 0-10? 100? ...? Configuration data appears to be sent with the page so bandwidth is a great concern. For example:

[
  {
    name: "Reader-trust-survey-en-v1",
    question: "Reader-trust-1-message",
    description: "Reader-trust-1-description",
    module: "ext.quicksurveys.survey.Reader-trust-survey-en-v1",
    coverage: 1,
    platforms: { desktop: ["stable"], mobile: ["stable"] },
    privacyPolicy: "Reader-trust-1-privacy",
    type: "external",
    link: "Reader-trust-1-link",
    instanceTokenParameterName: "token",
    isInsecure: false
  }
]

3
If I understand correctly, we have coverage, which is essentially the split between "control" (no survey) and "test" (see a survey) users. And then you're asking, for those who are selected to see a survey, do we need to split them up in arbitrary ways? No, I think even ratios will be sufficient. Worse case, if we end up needing the flexibility, we can still direct different buckets to the same place (e.g., if A and B are the same, then functionally we have created a 66%-33% split).

If I understand correctly, it is safe to assume that a given page is either unsampled, sampled and in the test bucket, or sampled and in the control bucket. I guess this question is related to the first question.

This probably won't work well for obscure pages as pages without Q codes or missing sitelinks cannot be considered. Still, unless you plan to intentionally test lots of unpopular pages, using the Q item ID seems like the best approach to me as you cannot guarantee a page will exist on all wikis even with page IDs or page titles.

I realized after posting this that I made assumptions about how pages will be chosen. Depending, it may be preferable to use page IDs over Q codes.

I've factored this request into the implementation of my patch in https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/QuickSurveys/+/493130/ to allow it to be used in this way.

Change 493614 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/extensions/QuickSurveys@master] WIP: Allow the running of surveys on specific pages

https://gerrit.wikimedia.org/r/493614

Isaac added a comment.Mar 1 2019, 3:12 PM

Sorry for not getting back more quickly on this. Thanks both @Niedzielski and @Jdlrobson for working on this! Very exciting to see page title and edit count going through! I want to make sure I (and others who don't go through the code) understand the changes that are happening.

Page titles:

  • The page titles only apply to the main article space (so including page titles like Main_Page or Special:RecentChanges would not work at the moment)?
  • The coverage specified in the config is a single value that applies to all pages.

Edit Counts:

  • Right now, you can do the equivalent of sampling on users with edits less than some maximum (e.g., <5 edits), between a minimum and a maximum (e.g., 5-10 edits), and above a minimum (e.g., <10 edits), which uses the actual edit count as opposed to the pre-set buckets like "1-4 edits" reported in EventLogging (thanks for this!)
  • I don't see a way right now to exclude anonymous users through the edit count though (looking at the isInAudience function at https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/QuickSurveys/+/493130/5/resources/ext.quicksurveys.lib/lib.js@133). Anonymous users seem to be treated as having zero edits: mw.config.get( 'wgUserEditCount' ) returns null when not logged in but null is treated as 0 when doing greater than / less than comparisons and there is no check for null values of the editCount variable. Is there plans to have a separate check for logged-in vs. not-logged-in? Or am I missing some way of excluding them through the editCount variable?

Thanks again!!

Very exciting to see page title and edit count going through! I want to make sure I (and others who don't go through the code) understand the changes that are happening.

To be clear, our product owner has only asked us to implement "edit count". The patch here may need further support from your team to go through, but I wanted to capture how this might work.

The coverage specified in the config is a single value that applies to all pages.

The coverage is actually an array. I'd like to understand the surveys we want to run though and make sense of whether that's important.

The page titles only apply to the main article space (so including page titles like Main_Page or Special:RecentChanges would not work at the moment)?

My proposal covers all namespaces, however, this may need to be thought about if we want to target different special pages across different languages... right now [ 'Special:RecentChanges' ] would not match https://fr.m.wikipedia.org/wiki/Sp%C3%A9cial:Modifications_r%C3%A9centes for example.. if you wanted to target Special:RecentChanges on French and English config would look like [ 'Special:RecentChanges', 'Spécial:Modifications_récentes' ]. If we plan to run across an entire set of special pages across languages this probably doesn't make sense. For me this is the part that will make this task tricky.

Since this task is about title, I'm going to reply to your other comments on the other ticket (T139317).

Isaac added a comment.Mar 1 2019, 4:15 PM

Ahh...slowly piecing together all the tasks :) Thanks for helping to link them @Jdlrobson

My proposal covers all namespaces, however, this may need to be thought about if we want to target different special pages across different languages... right now [ 'Special:RecentChanges' ] would not match https://fr.m.wikipedia.org/wiki/Sp%C3%A9cial:Modifications_r%C3%A9centes for example.. if you wanted to target Special:RecentChanges on French and English config would look like [ 'Special:RecentChanges', 'Spécial:Modifications_récentes' ]. If we plan to run across an entire set of special pages across languages this probably doesn't make sense. For me this is the part that will make this task tricky.

My understanding right now is that the config is project-specific. For instance, if you look at the subtasks under T131949, you'll see separate configs for each language (and then we went into each language edition and created the necessary MediaWiki: interface pages). And this frankly makes sense because we often have to translate the content and so each project really does need a separate set of pages and its own config. So I suspect if we ever got to a point where we wanted to survey a massive number of languages on the same pages, identifying the right page titles in each language would be much less effort than the other config/prep work that would need to happen.

Niedzielski removed Niedzielski as the assignee of this task.Mar 1 2019, 7:41 PM

Change 493614 abandoned by Jdlrobson:
WIP: Allow the running of surveys on specific pages

https://gerrit.wikimedia.org/r/493614

Bumping this to tracking. We have provided a platform to define audiences and a POC, but we don't feel confident in being able to add this without some concrete examples of surveys to run. In the mean time any team/volunteer should feel empowered to expand QuickSurveys with that functionality if urgent and they can!

Isaac added a comment.Mar 5 2019, 9:10 PM

@jmatazzoni : see above based on your interest in T146495

Volker_E removed a subscriber: Volker_E.Mar 5 2019, 10:35 PM

I don't think the Wikidata Q item identifier varies

I think that, for most purposes, this is a safe assumption. However, it is possible to change which pages are linked to a given Wikidata Q item. For example, it's possible that https://de.wikipedia.org/wiki/Wissenschaft and https://en.wikipedia.org/wiki/Wissenschaft will both end up on the same Wikidata Q item, since they are actually about exactly the same subject. (The German Wikipedia article on "Wissenschaft" currently shares a Wikidata Q item for the English Wikipedia article on "Science".)

leila edited projects, added Research-Backlog; removed Research.Jul 11 2019, 4:05 PM