As with previous editions, the 2019 CE Insights survey will need help selecting a sample of currently active contributors.
We will need to do the following:
- Update last year's code with the following changes:
- Combine the two highest bins: 1200-3499 and 3500+ edit bins into a single 1200+ bin.
- Pull 450 users from a new South Asian language stratum (containing Hindi, Punjabi, Malayam, Maithili, Gujarati, Tamil, Urdu). Project group stratum labeled sasia_wps.
- Pull 150 users from a new Vietnamese language stratum. Project group stratum labeled vi_wiki.
- Pull 150 users from a new Malay language stratum (containing Malay and Indonesian). Project group stratum labeled malay_wps.
- Pull 150 users from a new Korean language stratum. Project group stratum labeled ko_wiki.
- Use the combination of the "2019 contributor opt-outs" tab from this spreadsheet and the "Dashboard Leaders by home wiki 2018-2019" tab of this spreadsheet) as our opt-out list.
- Add the Programs and Events Dashboard users and email if available (from the "Dashboard Leaders by home wiki 2018-2019" tab of this spreadsheet) to sample table. Project group stratum labeled pe_dashboard_users.
- Use a fixed seed value for the sampling (e.g. the random_state parameter of pandas.DataFrame.sample()) so that, if necessary, we can re-run it with changes while still selecting the same users
- Run the code to provide the users sampled for each stratum
- Look up each user's registered email address (this will require using the MediaWiki replicas; email addresses are available both centrally in the centralauth.globaluser table and locally in each wiki's user table, but it's not clear if one of those places is more reliable or more up to date)
- Provide Learning and Evaluation with a list of the sampled users, with the following data for each:
- user name
- home wiki
- registered email address if found (note this makes the dataset [sensitive])
- email verification date if found
- edit count stratum
- wiki stratum