Page MenuHomePhabricator

Demographic Surveys Configurations
Closed, ResolvedPublic

Description

Overview

This task encompasses the configuration for each demographics survey and the creation of the associated interface pages. Each configuration and interface page links will be added as a comment below as they are finalized. See T212444 for tracking of links to the configs for each language.

Coverage Calculations

Based on 2017 survey sampling rates:

Wikipedia LanguageJune 2017 PVs (in millions)2017 Sampling Rate (1:X)2017 Response countMay 2019 PVs (in millions)delta May->June (2018)2019 Desired # Responses2019 Sampling Rate (1:X)Actual # Responses
ar -- Arabic1471021581780.935000410595
be -- Bengali8.111198
de -- German9015280009740.965000295118
en -- English (world)7125402414077780.931000098~8000
en -- English (Africa)1650.9450002~10500
es -- Spanish107153902111640.86100001517733
fa -- Persian1470.97150099214
fr -- French (world)7340.89500012~5900
fr -- French (Africa)610.9120002~4200
he -- Hebrew46.638848620.92150021992
hi -- Hindi30.523064
hu -- Hungarian38.22.52455550.93150051725
ja -- Japanese1056519996
nl -- Dutch14583277
no -- Norwegian360.9415002998
pt -- Portuguese3611.0150006
ro -- Romanian26.723829440.83150051873
ru -- Russian8435676218380.895000596204
uk -- Ukrainian37.82.58041670.711500121579
zh -- Chinese3672059574100.995000263259

Notes

  • To correct for seasonal changes in page views between May (page view numbers are available) and June (when survey will run), the ratio of page views from May 2018 to June 2018 is included as the delta May -> June (2018) column. See T226273#5276481 for more details.
  • Additionally, the sampling for Ukrainian/Romanian/Spanish is further rounded up to account for especially low July numbers (likely to affect the surveys as they are end of June).
  • Coverage rates for surveys that were not run in 2017 are based on Japanese as a very conservative case as far as number of expected responses per page view (an order of magnitude lower than languages like Hebrew, Ukrainian) though only about half of most languages (English, German, Spanish).
  • English total for both surveys was 18488 -- this can't be exactly split into the worldwide and Africa surveys due to it not being clear which survey was taken by readers under 18, but ~57% (~10500) of over-18 responses were from the African survey and ~43% were from the worldwide survey (~8000)
  • Same for French: of 10161 total responses, ~58% or ~5900 responses were for worldwide and ~42% or ~4200 responses were from the African survey.

Desired number of responses fall into a few categories:

  • at a minimum, we aim for 1500 responses in order to provide robust demographic results and some stratifications (e.g., age vs. gender). With 1500 responses, we will be able to debias but will likely not have enough responses to do subgroup discovery or other, more nuanced analyses. We do this for languages that are largely concentrated in a single country and are not high page-view languages (fa, he, hu, no, ro, uk).
  • for languages with high page views and many countries (en, fr, es), we aim for 10,000 responses.
  • for the surveys specific to Africa, we aim for as many page views as we can get while not sampling everyone.
  • for all other languages, we aim for 2000-5000 responses as a balance between enough responses to do country-specific stratifications and not oversampling

Event Timeline

@Isaac , I believe the approach you took to calculating the sampling rate is misleading and you might end up with a low number of responses, at least for Romanian. Since you extrapolate directly based on 2017 data, you seem to assume the differences in pageviews between May 2019 and June 2017 are all due to the normal evolution of the wiki, with no seasonal variation. However, looking at the available pageview data for ro.wp ( https://stats.wikimedia.org/v2/#/ro.wikipedia.org/reading/total-page-views/normal|table|all|~total|monthly ), one can see there is a decrease in PV each June on the order of 15-20%, followed by another decrease in July of more than 20% compared to June. These are caused by school recess, people taking vacations or simply going out more and will likely affect your potential audience.

A better algorithm would be the following:

  1. Calculate the increase in pageviews between May 2019 and May 2017 (or throughout the first 5 months for even better accuracy)
  2. Apply this increase to July 2017 data to extrapolate July 2019 PV data.
  3. Calculate the sampling rate using July 2019 PV data.

For ro.wp, that would give you:
May 19/May 17: 44438741÷33892589 = 1.311163954
July19 extrapolated PV: 1.311163954×22751557 = 29831021.432730825
Needed sample rate: 29831021.432730825 / (26700000 / 2 / 3829 * 1500) = 5.704 (rounded: 5 or worse case 6)

@Strainu thanks for pointing that out! Indeed, I was making an assumption in the calculations that the seasonality would not be that intense and that doesn't seem to hold-up in certain wikis. See updated numbers in the description. I made a simple May -> June correction based on 2018 as that seems to capture most of it and this is an inexact process.

I think you should use July, not June data, if you plan on running the survey in July :)

I think you should use July, not June data, if you plan on running the survey in July :)

Yeah, the survey will hopefully be complete by July but it's also a fair point that for Romanian, the end of June looks a lot more like July than it looks like the rest of June. I'll bump the sampling up for Romanian, Ukrainian, and Spanish (similar drops) to account for it. For a couple (Chinese, Arabic), July actually trends back up, which is why I did not just fully switch to July numbers.

English (worldwide)

SurveyDurationGoalStart Date and timeEnd Date
reader-demographics-en7 days10000 responses2019/6/26 @ 1100 UTC (0700 EST)2019/7/1 @ 1100 UTC

Interface pages

enwiki => [
    'enabled' => true,
    "name" => "reader-demographics-en",
    "type" => "external",
    "description" => "Reader-demographics-1-description",
    "link" => "Reader-demographics-1-link",
    "question" => "Reader-demographics-1-message",
    "privacyPolicy" => "Reader-demographics-1-privacy",
    "coverage" => 0.01, // 1 out of 100
    "instanceTokenParameterName" => "entry.1791119923",
    "platforms" => [
        "desktop"=> ["stable"],
        "mobile"=> ["stable"]
    ],
]

English (Africa)

SurveyDurationGoalStart Date and timeEnd Date
reader-demographics-en-af7 days5000 responses2019/6/26 @ 1100 UTC (0700 EST)2019/7/1 @ 1100 UTC

Interface pages

enwiki => [
    'enabled' => true,
    "name" => "reader-demographics-en-af",
    "type" => "external",
    "description" => "Reader-demographics-1-description",
    "link" => "Reader-demographics-1-link",
    "question" => "Reader-demographics-1-message",
    "privacyPolicy" => "Reader-demographics-1-privacy",
    "coverage" => 0.5, // 1 out of 2
    "instanceTokenParameterName" => "entry.1791119923",
    "platforms" => [
                "desktop"=> ["stable"],
                "mobile"=> ["stable"],
    ],
       'audience': {
              "countries" => ["AO", "BF", "BI", "BJ", "BW", "CD", "CF", "CG", "CI", "CM", "CV", "DJ", "DZ", "EG", "EH", "ER", "ET", "GA", "GH", "GM", "GN", "GQ", "GW", "KE", "KM", "LR", "LS", "LY", "MA", "MG", "ML", "MR", "MU", "MW", "MZ", "NA", "NE", "NG", "RE", "RW", "SC", "SD", "SH", "SL", "SN", "SO", "SS", "ST", "SZ", "TD", "TG", "TN", "TZ", "UG", "YT", "ZA", "ZM", "ZW"]
       }
]
NOTE: this configuration uses the same pages as English (worldwide) T226273#5279221 but has a higher rate of sampling and list of countries associated with it. Both should be deployed.

Arabic

SurveyDurationGoalStart Date and timeEnd Date
reader-demographics-ar7 days5000 responses2019/6/26 @ 1100 UTC (0700 EST)2019/7/1 @ 1100 UTC

Interface pages

Configuration

arwiki => [
    'enabled' => true,
    "name" => "reader-demographics-ar",
    "type" => "external",
    "description" => "Reader-segmentation-1-description",
    "link" => "Reader-demographics-1-link",
    "question" => "Reader-demographics-1-message",
    "privacyPolicy" => "Reader-demographics-1-privacy",
    "coverage" => 0.25, // 1 out of 4
    "instanceTokenParameterName" => "entry.1791119923",
    "platforms" => [
        "desktop"=> ["stable"],
        "mobile"=> ["stable"]
    ],
]

German

SurveyDurationGoalStart Date and timeEnd Date
reader-demographics-de7 days5000 responses2019/6/26 @ 1100 UTC (0700 EST)2019/7/1 @ 1100 UTC

Interface pages

dewiki => [
    'enabled' => true,
    "name" => "reader-demographics-de",
    "type" => "external",
    "description" => "Reader-segmentation-1-description",
    "link" => "Reader-demographics-1-link",
    "question" => "Reader-demographics-1-message",
    "privacyPolicy" => "Reader-demographics-1-privacy",
    "coverage" => 0.034, // 1 out of 29
    "instanceTokenParameterName" => "entry.1791119923",
    "platforms" => [
        "desktop"=> ["stable"],
        "mobile"=> ["stable"]
    ],
]

French (worldwide)

SurveyDurationGoalStart Date and timeEnd Date
reader-demographics-fr7 days5000 responses2019/6/26 @ 1100 UTC (0700 EST)2019/7/1 @ 1100 UTC

Interface pages

frwiki => [
    'enabled' => true,
    "name" => "reader-demographics-fr",
    "type" => "external",
    "description" => "Reader-demographics-1-description",
    "link" => "Reader-demographics-1-link",
    "question" => "Reader-demographics-1-message",
    "privacyPolicy" => "Reader-demographics-1-privacy",
    "coverage" => 0.083, // 1 out of 12
    "instanceTokenParameterName" => "entry.1791119923",
    "platforms" => [
        "desktop"=> ["stable"],
        "mobile"=> ["stable"]
    ],
]

French (Africa)

SurveyDurationGoalStart Date and timeEnd Date
reader-demographics-fr-af7 days2000 responses2019/6/26 @ 1100 UTC (0700 EST)2019/7/1 @ 1100 UTC

Interface Pages

frwiki => [
    'enabled' => true,
    "name" => "reader-demographics-fr-af",
    "type" => "external",
    "description" => "Reader-demographics-1-description",
    "link" => "Reader-demographics-1-link",
    "question" => "Reader-demographics-1-message",
    "privacyPolicy" => "Reader-demographics-1-privacy",
    "coverage" => 0.5, // 1 out of 2
    "instanceTokenParameterName" => "entry.1791119923",
    "platforms" => [
                "desktop"=> ["stable"],
                "mobile"=> ["stable"],
    ],
       'audience': {
              "countries" => ["AO", "BF", "BI", "BJ", "BW", "CD", "CF", "CG", "CI", "CM", "CV", "DJ", "DZ", "EG", "EH", "ER", "ET", "GA", "GH", "GM", "GN", "GQ", "GW", "KE", "KM", "LR", "LS", "LY", "MA", "MG", "ML", "MR", "MU", "MW", "MZ", "NA", "NE", "NG", "RE", "RW", "SC", "SD", "SH", "SL", "SN", "SO", "SS", "ST", "SZ", "TD", "TG", "TN", "TZ", "UG", "YT", "ZA", "ZM", "ZW"]
       
]
NOTE: this configuration uses the same pages as French (worldwide) T226273#5279624 but has a higher rate of sampling and list of countries associated with it. Both should be deployed.

Hebrew

SurveyDurationGoalStart Date and timeEnd Date
reader-demographics-he7 days1500 responses2019/6/26 @ 1100 UTC (0700 EST)2019/7/1 @ 1100 UTC

Interface pages

hewiki => [
    'enabled' => true,
    "name" => "reader-demographics-he",
    "type" => "external",
    "description" => "Reader-segmentation-1-description",
    "link" => "Reader-demographics-1-link",
    "question" => "Reader-demographics-1-message",
    "privacyPolicy" => "Reader-demographics-1-privacy",
    "coverage" => 0.05, // 1 out of 20
    "instanceTokenParameterName" => "entry.1791119923",
    "platforms" => [
        "desktop"=> ["stable"],
        "mobile"=> ["stable"]
    ],
]

Hungarian

SurveyDurationGoalStart Date and timeEnd Date
reader-demographics-hu7 days1500 responses2019/6/26 @ 1100 UTC (0700 EST)2019/7/1 @ 1100 UTC

Interface pages

huwiki => [
    'enabled' => true,
    "name" => "reader-demographics-hu",
    "type" => "external",
    "description" => "Reader-segmentation-1-description",
    "link" => "Reader-demographics-1-link",
    "question" => "Reader-demographics-1-message",
    "privacyPolicy" => "Reader-demographics-1-privacy",
    "coverage" => 0.2, // 1 out of 5
    "instanceTokenParameterName" => "entry.1791119923",
    "platforms" => [
        "desktop"=> ["stable"],
        "mobile"=> ["stable"]
    ],
]

Norwegian (Bokmål)

SurveyDurationGoalStart Date and timeEnd Date
reader-demographics-no7 days1500 responses2019/6/26 @ 1100 UTC (0700 EST)2019/7/1 @ 1100 UTC

Interface pages

nowiki => [
    'enabled' => true,
    "name" => "reader-demographics-no",
    "type" => "external",
    "description" => "Reader-demographics-1-description",
    "link" => "Reader-demographics-1-link",
    "question" => "Reader-demographics-1-message",
    "privacyPolicy" => "Reader-demographics-1-privacy",
    "coverage" => 0.5, // 1 out of 2
    "instanceTokenParameterName" => "entry.1791119923",
    "platforms" => [
        "desktop"=> ["stable"],
        "mobile"=> ["stable"]
    ],
]

Romanian

SurveyDurationGoalStart Date and timeEnd Date
reader-demographics-ro7 days1500 responses2019/6/26 @ 1100 UTC (0700 EST)2019/7/1 @ 1100 UTC

Interface pages

rowiki => [
    'enabled' => true,
    "name" => "reader-demographics-ro",
    "type" => "external",
    "description" => "Reader-segmentation-1-description",
    "link" => "Reader-demographics-1-link",
    "question" => "Reader-demographics-1-message",
    "privacyPolicy" => "Reader-demographics-1-privacy",
    "coverage" => 0.2, // 1 out of 5
    "instanceTokenParameterName" => "entry.1791119923",
    "platforms" => [
        "desktop"=> ["stable"],
        "mobile"=> ["stable"]
    ],
]

Russian

SurveyDurationGoalStart Date and timeEnd Date
reader-demographics-ru7 days5000 responses2019/6/26 @ 1100 UTC (0700 EST)2019/7/1 @ 1100 UTC

Interface pages

ruwiki => [
    'enabled' => true,
    "name" => "reader-demographics-ru",
    "type" => "external",
    "description" => "Reader-segmentation-1-description",
    "link" => "Reader-demographics-1-link",
    "question" => "Reader-demographics-1-message",
    "privacyPolicy" => "Reader-demographics-1-privacy",
    "coverage" => 0.02, // 1 out of 50
    "instanceTokenParameterName" => "entry.1791119923",
    "platforms" => [
        "desktop"=> ["stable"],
        "mobile"=> ["stable"]
    ],
]

Ukrainian

SurveyDurationGoalStart Date and timeEnd Date
reader-demographics-uk7 days1500 responses2019/6/26 @ 1100 UTC (0700 EST)2019/7/1 @ 1100 UTC

Interface pages

ukwiki => [
    'enabled' => true,
    "name" => "reader-demographics-uk",
    "type" => "external",
    "description" => "Reader-segmentation-1-description",
    "link" => "Reader-demographics-1-link",
    "question" => "Reader-demographics-1-message",
    "privacyPolicy" => "Reader-demographics-1-privacy",
    "coverage" => 0.08, // 1 out of 12
    "instanceTokenParameterName" => "entry.1791119923",
    "platforms" => [
        "desktop"=> ["stable"],
        "mobile"=> ["stable"]
    ],
]

Chinese

SurveyDurationGoalStart Date and timeEnd Date
reader-demographics-zh7 days5000 responses2019/6/26 @ 1100 UTC (0700 EST)2019/7/1 @ 1100 UTC

Interface pages

zhwiki => [
    'enabled' => true,
    "name" => "reader-demographics-zh",
    "type" => "external",
    "description" => "Reader-segmentation-1-description",
    "link" => "Reader-demographics-1-link",
    "question" => "Reader-demographics-1-message",
    "privacyPolicy" => "Reader-demographics-1-privacy",
    "coverage" => 0.04, // 1 out of 25
    "instanceTokenParameterName" => "entry.1791119923",
    "platforms" => [
        "desktop"=> ["stable"],
        "mobile"=> ["stable"]
    ],
]

Persian

SurveyDurationGoalStart Date and timeEnd Date
reader-demographics-fa7 days1500 responses2019/6/26 @ 1100 UTC (0700 EST)2019/7/1 @ 1100 UTC

Interface pages

fawiki => [
    'enabled' => true,
    "name" => "reader-demographics-fa",
    "type" => "external",
    "description" => "Reader-demographics-1-description",
    "link" => "Reader-demographics-1-link",
    "question" => "Reader-demographics-1-message",
    "privacyPolicy" => "Reader-demographics-1-privacy",
    "coverage" => 0.11, // 1 out of 9
    "instanceTokenParameterName" => "entry.1791119923",
    "platforms" => [
        "desktop"=> ["stable"],
        "mobile"=> ["stable"]
    ],
]

Spanish

SurveyDurationGoalStart Date and timeEnd Date
reader-demographics-es7 days10000 responses2019/6/26 @ 1100 UTC (0700 EST)2019/7/1 @ 1100 UTC

Interface pages

eswiki => [
    'enabled' => true,
    "name" => "reader-demographics-es",
    "type" => "external",
    "description" => "Reader-segmentation-1-description",
    "link" => "Reader-demographics-1-link",
    "question" => "Reader-demographics-1-message",
    "privacyPolicy" => "Reader-demographics-1-privacy",
    "coverage" => 0.07, // 1 out of 15
    "instanceTokenParameterName" => "entry.1791119923",
    "platforms" => [
        "desktop"=> ["stable"],
        "mobile"=> ["stable"]
    ],
]

@bmansurov All of the configurations that are launching tomorrow have been uploaded as comments on this task. You can see an overview with each language and the link to the configuration in the description of this task as well: T212444

Currently we have surveys for: arwiki, dewiki, enwiki (2), eswiki, fawiki, frwiki (2), hewiki, huwiki, nowiki, rowiki, ruwiki, ukwiki, zhwiki

Portuguese Wikipedia is not complete at this point so you can skip it for now. If it is completed at some point today, I will update this.

Change 519167 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[operations/mediawiki-config@master] Enable reader demographics surveys

https://gerrit.wikimedia.org/r/519167

@bmansurov copying over what I said in IRC just in case:

Thanks! I quickly scanned the patch and looks good to me. I'll be awake ahead of deployment so I'm around in case there are questions and to help with testing as needed.

@Isaac do you want to divide up the wikis for testing? We'll have a little time to test before deploying everywhere.

I'll take arwiki, dewiki, enwiki, eswiki, fawiki, frwiki.

Will you take hewiki, huwiki, nowiki, rowiki, ruwiki, ukwiki, zhwiki?

Can you also join #wikimedia-operations during deployment?

Change 519167 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable reader demographics surveys

https://gerrit.wikimedia.org/r/519167

Mentioned in SAL (#wikimedia-operations) [2019-06-26T11:17:42Z] <urbanecm@deploy1001> sync-file aborted: Reverting [[:gerrit:519167]] (T226273) (duration: 00m 32s)

Change 519215 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[operations/mediawiki-config@master] QuickSurveys: rename some surveys

https://gerrit.wikimedia.org/r/519215

Change 519215 merged by jenkins-bot:
[operations/mediawiki-config@master] QuickSurveys: rename some surveys

https://gerrit.wikimedia.org/r/519215

Mentioned in SAL (#wikimedia-operations) [2019-06-26T11:40:08Z] <dcausse@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T226273: Enable reader demographics surveys (duration: 00m 55s)

Change 519216 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[operations/mediawiki-config@master] Undeploy reader demographics surveys

https://gerrit.wikimedia.org/r/519216

Notes regarding deployment:

  • It was determined that two surveys cannot have the same name (as was the case for both the English and French surveys) -- as a fix, we appended "-af" to the surveys that are targeted to African countries. Though for statistical purposes, this is nice, we had avoided it because it means that readers in African countries might see survey again after answering. Dismissing or answering it the second time will completely dismiss it though and this is not different from a reader who switches browsers or clears their cache (who also could see the survey multiple times even after answering).
  • For the English survey, I realized during deployment that the survey link was to a slightly older version of the survey used in the pilot testing. I corrected the link and the survey was going to the correct place after approximately 20 minutes.
  • For the Norwegian and Russian surveys, survey logic had accidentally been turned on that required a number as the answer for the survey-ID question (which is prefilled with a string of characters and numbers). This would have caused some frustration and potentially prevented readers from responding. It affected approximately 20 surveys for Russian and a smaller number for Norwegian but was quickly fixed.

It was determined that two surveys cannot have the same name (as was the case for both the English and French surveys) -- as a fix, we appended "-af" to the surveys that are targeted to African countries. Though for statistical purposes, this is nice, we had avoided it because it means that readers in African countries might see survey again after answering. Dismissing or answering it the second time will completely dismiss it though and this is not different from a reader who switches browsers or clears their cache (who also could see the survey multiple times even after answering).

Maybe it's time to introduce "NOT" selector to audiences, which we could have used to exclude the first enwiki survey from the African countries.

We should also introduce continents because finding and adding all countries in a continent maybe error prone and time consuming.

Maybe it's time to introduce "NOT" selector to audiences, which we could have used to exclude the first enwiki survey from the African countries.

Yeah, I was thinking about that but was not certain whether our use-case (survey the world but upsample a certain region) is representative of how this functionality will be generally used. But also it could be useful for single surveys -- for instance, excluding France from French Wikipedia sampling already allows for much higher sampling of Africa.

We should also introduce continents because finding and adding all countries in a continent maybe error prone and time consuming.

Haha, yeah, in this case I just filtered this list to Africa but also considered just grabbing a large sample of continent/countries from the page view logs and doing the same. Continents would definitely be cleaner and less error-prone but I would have to think about how often they would actually be used. Including the "NOT" selector probably makes continents less necessary too.

Change 519216 merged by Urbanecm:
[operations/mediawiki-config@master] Undeploy reader demographics surveys

https://gerrit.wikimedia.org/r/519216

Mentioned in SAL (#wikimedia-operations) [2019-07-03T16:09:15Z] <urbanecm@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[:gerrit:519216|Undeploy reader demographics surveys]] (T226273) (duration: 00m 49s)

surveys finished deployment. description updated w/ final counts of survey responses for each language edition. note that these counts include responses under 18, which was between 10% (german) and 35% (hebrew) of all survey responses. respondents under 18 did not respond to other questions though, so they will not be included in any other analyses.