What's requested: Global Data & Insights needs help drawing its annual sample of active users! We have a 2-step request: provide the number of active editors meeting each home-wiki/editing-activity bin, then pull a sample of active users (the # for each bin can be provided after we have the # of active editors, we'll need this data to calculate the sample size). I'll include more detail below.
Why it's requested: So that we can continue to monitor community health and demographics with the Community Insights survey.
When it's requested: We would like to have the final sample available by August 14.
Other helpful information:
The previous years' requests and code are accessible here and here.
As with previous editions, the 2019 CE Insights survey will need help selecting a sample of currently active contributors.
We will need the following:
- Population count of active editors by edit bin and home wiki (an example of the breakout from 2019 is here)
- The population count includes users who were active editors in at least 2 of the past 3 months: registered, non-bot editors with at least 5 content edits across all Wikimedia content projects.
- Home wiki is defined as the project where the user has made the most edits during the past year.
- The five edit bins are [10, 30) [30, 150) [150, 600) [600, 1200) and [1200, 2500000)
- Home wiki categories are mostly the same as 2019 with a few changes bolded below:
- arwiki
- asia_wps (the following language projects should be moved from asia_wps to sasia_wps: as, bh, bn, dty, kn, mr, ne, or, pnb, ps, sat, sa, sd, si, tcy, te)
- cee_wps
- commons
- dewiki
- enwiki
- eswiki
- frwiki
- itwiki
- jawiki
- kowiki
- malay_wps
- meaf_wps
- metawiki (this is a new home wiki category)
- nlwiki
- other
- ptwiki
- ruwiki
- sasia_wps (the following language projects should be moved from asia_wps to sasia_wps: as, bh, bn, dty, kn, mr, ne, or, pnb, ps, sat, sa, sd, si, tcy, te)
- viwiki
- weur_wps
- wikidata
- zhwiki
- pedashboardusers
- With the population data, we can quickly calculate the number of users we'll need to sample. We'll then need users pulled for the sample. This year we can provide the number needed for each combination of home-wiki/editing-activity. The other criteria will be to include only users who (1) have provided an email address, and (2) have EmailThisUser enabled.
- Use the combination of the "2019 contributor opt-outs" tab from this spreadsheet and the "Dashboard Leaders by home wiki 2018-2019" tab of this spreadsheet) as our opt-out list.
- Add the Programs and Events Dashboard users and email if available (from the "Dashboard Leaders by home wiki 2018-2019" tab of this spreadsheet) to sample table. Project group stratum labeled pe_dashboard_users.
- Use a fixed seed value for the sampling (e.g. the random_state parameter of pandas.DataFrame.sample()) so that, if necessary, we can re-run it with changes while still selecting the same users
- Run the code to provide the users sampled for each stratum
- Look up each user's registered email address (this will require using the MediaWiki replicas; email addresses are available both centrally in the centralauth.globaluser table and locally in each wiki's user table, but it's not clear if one of those places is more reliable or more up to date)
- Provide Global Data & Insights with a list of the sampled users, with the following data for each:
- user name
- home wiki
- registered email address (note this makes the dataset [sensitive])
- email verification date if found
- edit count stratum
- wiki stratum
Thank you! Please let us know if there's any other info you need.