Page MenuHomePhabricator

[REQUEST] Sample of Active Users for Community Insights Survey
Closed, ResolvedPublicAug 14 2020

Description

What's requested: Global Data & Insights needs help drawing its annual sample of active users! We have a 2-step request: provide the number of active editors meeting each home-wiki/editing-activity bin, then pull a sample of active users (the # for each bin can be provided after we have the # of active editors, we'll need this data to calculate the sample size). I'll include more detail below.

Why it's requested: So that we can continue to monitor community health and demographics with the Community Insights survey.

When it's requested: We would like to have the final sample available by August 14.

Other helpful information:

The previous years' requests and code are accessible here and here.

As with previous editions, the 2019 CE Insights survey will need help selecting a sample of currently active contributors.

We will need the following:

  • Population count of active editors by edit bin and home wiki (an example of the breakout from 2019 is here)
      • The population count includes users who were active editors in at least 2 of the past 3 months: registered, non-bot editors with at least 5 content edits across all Wikimedia content projects.
      • Home wiki is defined as the project where the user has made the most edits during the past year.
    • The five edit bins are [10, 30) [30, 150) [150, 600) [600, 1200) and [1200, 2500000)
    • Home wiki categories are mostly the same as 2019 with a few changes bolded below:
      • arwiki
      • asia_wps (the following language projects should be moved from asia_wps to sasia_wps: as, bh, bn, dty, kn, mr, ne, or, pnb, ps, sat, sa, sd, si, tcy, te)
      • cee_wps
      • commons
      • dewiki
      • enwiki
      • eswiki
      • frwiki
      • itwiki
      • jawiki
      • kowiki
      • malay_wps
      • meaf_wps
      • metawiki (this is a new home wiki category)
      • nlwiki
      • other
      • ptwiki
      • ruwiki
      • sasia_wps (the following language projects should be moved from asia_wps to sasia_wps: as, bh, bn, dty, kn, mr, ne, or, pnb, ps, sat, sa, sd, si, tcy, te)
      • viwiki
      • weur_wps
      • wikidata
      • zhwiki
      • pedashboardusers
  • With the population data, we can quickly calculate the number of users we'll need to sample. We'll then need users pulled for the sample. This year we can provide the number needed for each combination of home-wiki/editing-activity. The other criteria will be to include only users who (1) have provided an email address, and (2) have EmailThisUser enabled.
  • Use the combination of the "2019 contributor opt-outs" tab from this spreadsheet and the "Dashboard Leaders by home wiki 2018-2019" tab of this spreadsheet) as our opt-out list.
  • Add the Programs and Events Dashboard users and email if available (from the "Dashboard Leaders by home wiki 2018-2019" tab of this spreadsheet) to sample table. Project group stratum labeled pe_dashboard_users.
  • Use a fixed seed value for the sampling (e.g. the random_state parameter of pandas.DataFrame.sample()) so that, if necessary, we can re-run it with changes while still selecting the same users
  • Run the code to provide the users sampled for each stratum
  • Look up each user's registered email address (this will require using the MediaWiki replicas; email addresses are available both centrally in the centralauth.globaluser table and locally in each wiki's user table, but it's not clear if one of those places is more reliable or more up to date)
  • Provide Global Data & Insights with a list of the sampled users, with the following data for each:
    • user name
    • home wiki
    • registered email address (note this makes the dataset [sensitive])
    • email verification date if found
    • edit count stratum
    • wiki stratum

Thank you! Please let us know if there's any other info you need.

Details

Due Date
Aug 14 2020, 5:00 AM

Event Timeline

Rmaung triaged this task as High priority.Jul 21 2020, 8:35 PM
Rmaung created this task.
nettrom_WMF renamed this task from [REQUEST] to [REQUEST] Sample of Active Users for Community Insights Survey.Jul 22 2020, 5:59 PM

@Rmaung : I renamed the task to make the title reflect a summary of what's requested, feel free to edit it again if I missed something. The Product Analytics team will triage this request in our next board refinement meeting, which is on July 28. Let us know if we need to take action sooner.

Hi @nshahquinn-wmf! Please let me know if there's any more information you need from me-- I can turn around the sample size request for each group as soon as we have population sizes. I'm unsure of the timeframe for the actual sample pull once I get you final sample numbers, so I just want to make sure you have everything you need from me!

@Rmaung thanks for checking in! I think you've provided all the information I need. I did the sampling before last year, and consulted with Megan on it last year, so I'm not expecting any surprises :)

I will work on finishing the wiki categorization changes and pulling the population counts tomorrow or Monday. Once you've decided on the sample numbers, I should be able to finish the sample within 2-3 days.

Let me know if you have any concerns!

@nshahquinn-wmf -- awesome! I will lookout for the population counts. Thanks so much for the update!

I delivered the population counts to Becky today. She should have the requested sample sizes for me by tomorrow, and hopefully I'll be able to finish the sampling by the end of the week.

Just delivered the sample to Becky! I'm going to spend a little more time on this to make sure that the code is clear and well-documented for next year.

The repo on GitHub is now up-to-date and nicely documented!