Page MenuHomePhabricator

Create active editor sample list for the 2019 Community Engagements Insights survey
Closed, ResolvedPublicAug 30 2019

Description

As with previous editions, the 2019 CE Insights survey will need help selecting a sample of currently active contributors.

We will need to do the following:

  • Update last year's code with the following changes:
    • Combine the two highest bins: 1200-3499 and 3500+ edit bins into a single 1200+ bin.
    • Pull 450 users from a new South Asian language stratum (containing Hindi, Punjabi, Malayam, Maithili, Gujarati, Tamil, Urdu). Project group stratum labeled sasia_wps.
    • Pull 150 users from a new Vietnamese language stratum. Project group stratum labeled vi_wiki.
    • Pull 150 users from a new Malay language stratum (containing Malay and Indonesian). Project group stratum labeled malay_wps.
    • Pull 150 users from a new Korean language stratum. Project group stratum labeled ko_wiki.
    • Use the combination of the "2019 contributor opt-outs" tab from this spreadsheet and the "Dashboard Leaders by home wiki 2018-2019" tab of this spreadsheet) as our opt-out list.
    • Add the Programs and Events Dashboard users and email if available (from the "Dashboard Leaders by home wiki 2018-2019" tab of this spreadsheet) to sample table. Project group stratum labeled pe_dashboard_users.
    • Use a fixed seed value for the sampling (e.g. the random_state parameter of pandas.DataFrame.sample()) so that, if necessary, we can re-run it with changes while still selecting the same users
  • Run the code to provide the users sampled for each stratum
  • Look up each user's registered email address (this will require using the MediaWiki replicas; email addresses are available both centrally in the centralauth.globaluser table and locally in each wiki's user table, but it's not clear if one of those places is more reliable or more up to date)
  • Provide Learning and Evaluation with a list of the sampled users, with the following data for each:
    • user name
    • home wiki
    • registered email address if found (note this makes the dataset [sensitive])
    • email verification date if found
    • edit count stratum
    • wiki stratum

Details

Due Date
Aug 30 2019, 1:30 PM

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 13 2018, 8:57 PM
nshahquinn-wmf triaged this task as Medium priority.Mar 22 2019, 1:04 AM
nshahquinn-wmf updated the task description. (Show Details)
nshahquinn-wmf moved this task from Next Up to Epics on the Product-Analytics board.
nshahquinn-wmf removed a subscriber: egalvezwmf.
JAnstee_WMF added a comment.EditedApr 20 2019, 3:42 PM

It seems that the adjusted timeline is looking like starting in June/July:

  • Updates to the population & sampling queries, if any (June/July)
  • Query the sample lists (July)
  • Post messages to talk pages in August and September.

So your work would start in June if you are available. We do not know yet what the updates to the queries; there may not be any. Let me know if this will work for you or if you have any questions.

It seems that the adjusted timeline is looking like starting in June/July:

  • Updates to the population & sampling queries, if any (June/July)
  • Query the sample lists (July)
  • Post messages to talk pages in August and September.

So your work would start in June if you are available. We do not know yet what the updates to the queries; there may not be any. Let me know if this will work for you or if you have any questions.

Yes, this seems totally doable! It's too early to know what other work I will have during that time, but given the relative looseness of the timeline and the fact I wrote and successfully used the code last year, I'm not worried.

Update on sample frame changes for 2019:
After reviewing the sample frame in terms of alignment to spaces identified for potential growth in the MTP, it seems we will increase our sampling methods for contributors in India and Asia by deepening sample for:

  • Indic languages to increase target by 450 via inclusion of 150 Hindi, 150 Punjabi and 30 each from Malayam, Maithili, Gujarati, Tamil, Urdu (150 total)
  • Asian languages beyond Japanese and Chinese to include:
    • Vietnamese (150)
    • Indonesian (150)
    • Korean (150)

Other than these additions the sampling of contributors will be consistent with last year, pulling stratified samples within each to capture a span of activity levels.

see: https://docs.google.com/spreadsheets/d/1UdX0VrS7LhyspGuGthgqluY4rjnT1z1XlbUej79G9Qc/edit#gid=311219244

nshahquinn-wmf added a subscriber: kzimmerman.EditedJul 3 2019, 2:11 PM

@JAnstee_WMF, thanks for the ping!

We discussed this task in a team meeting yesterday, and we decided that, because of my responsibilities to other teams, it would be better for another analyst on the team to take this on (with support from me). @kzimmerman is on vacation this week, but will discuss this with the other analysts next week and decide who it will be.

The last we said was that you wanted to get the sample lists pulled in July, and then send the messages to the users on those lists during August and September. Is that still accurate? We should be able to meet that timeline, but since we're actually pulling for a data source that updates at the end of each calendar month, it might be better to wait until August to make the lists so we can base it on May–July rather than April–June.

The changes you want to make are feasible, but there's one problem: in addition to stratifying users by their home wiki (as a proxy for language), we stratify them by edit count over the past year. That means that each additional language group we add means that we have six additional strata. In itself, this isn't a problem—the code could deal with hundreds of extra strata if you wanted.

The problem is that we want to make it impossible to reidentify respondents from their answers, so we can't have any strata where there are so few users in the population that all of them would be sampled. So we don't survey any strata where there are fewer 20 editors in the population, but if we have separate language groups for small languages like Gujarati, probably none of the 6 Gujarati strata would get surveyed.

So we should probably group all the Indic languages or at least the five smaller ones into one larger language group.

I've been looking back at my code from the last survey and remembering all the complexity, so once we know which analyst will be running it this time, the three of us should probably have a meeting to discuss the process and nail down this question about the sampling frame 😁

nshahquinn-wmf renamed this task from Update CE Insights query and bot to Support the 2019 Community Engagements Insights survey .Jul 3 2019, 3:47 PM
nshahquinn-wmf removed nshahquinn-wmf as the assignee of this task.
nshahquinn-wmf claimed this task.
nshahquinn-wmf removed nshahquinn-wmf as the assignee of this task.
nshahquinn-wmf moved this task from Epics to Triage on the Product-Analytics board.
nshahquinn-wmf updated the task description. (Show Details)
nshahquinn-wmf added a subscriber: nshahquinn-wmf.

Sounds good, Neil. Pulling the data in August should also work just fine. Thanks for the details on the Indic languages too, I was wondering about that exact thing, glad to hear it should work to group them and just increase our sampling there. Also glad to hear you have refreshed yourself of the details as I remember it is quite complex as well. Happy to jump on a call once y'all have responsibilities sorted! Thanks!

kzimmerman moved this task from Triage to Backlog on the Product-Analytics board.
kzimmerman added a subscriber: MNeisler.

Assigning to Megan, who will coordinate with @Neil_P._Quinn_WMF on knowledge transfer. The plan for now will be to pull data in August.

@JAnstee_WMF we're stretched a bit thin at the moment and might need to scale back some of the work Neil had previously done (which would mean more manual work on your side); @MNeisler will set up time to revisit scope with you, Neil, and me.

Rmaung added a subscriber: Rmaung.Jul 16 2019, 5:29 PM
kzimmerman set Due Date to Aug 16 2019, 1:00 AM.
Restricted Application changed the subtype of this task from "Task" to "Deadline". · View Herald TranscriptJul 16 2019, 10:14 PM
nshahquinn-wmf renamed this task from Support the 2019 Community Engagements Insights survey to Create active editor sample list for the 2019 Community Engagements Insights survey .Aug 14 2019, 9:54 AM
nshahquinn-wmf updated the task description. (Show Details)
Restricted Application added a subscriber: revi. · View Herald TranscriptAug 14 2019, 9:54 AM
nshahquinn-wmf changed Due Date from Aug 16 2019, 1:00 AM to Aug 30 2019, 1:30 PM.Aug 14 2019, 9:54 AM
kzimmerman moved this task from Next Up to Doing on the Product-Analytics board.Aug 20 2019, 9:08 PM
MNeisler updated the task description. (Show Details)Aug 23 2019, 12:23 AM
MNeisler updated the task description. (Show Details)Aug 27 2019, 6:23 PM

I finished pulling the active editor sample with the specified changes. Updated code repo.

@JAnstee_WMF - Please see the email with the current list of the sampled users for this year's survey. Let me know if you have any questions.

MNeisler updated the task description. (Show Details)Aug 27 2019, 11:42 PM
Rmaung added a comment.EditedAug 27 2019, 11:42 PM
  • Nevermind, just saw your message!* @MNeisler could you copy me on that email too?

@MNeisler could you copy me on that email too?

@Rmaung - Sure! You were copied. Let me know if you have any issues accessing the doc.

kzimmerman closed this task as Resolved.Sep 10 2019, 9:31 PM