Page MenuHomePhabricator

Prepare demographics survey data for analysis
Closed, ResolvedPublic

Description

Prepare survey data for analysis. This involves cleaning, joining in additional features, and running debiasing code per: https://meta.wikimedia.org/wiki/Research:Characterizing_Wikipedia_Reader_Behaviour/Code#Stage_Two:_Pre-Processing_the_Surveys

Event Timeline

Isaac triaged this task as High priority.Dec 20 2018, 6:57 PM
Isaac created this task.
leila closed this task as Declined.Mar 14 2019, 5:37 PM
leila reopened this task as Open.

This is now blocked on addressing the questions which we ran into as part of the pilot survey under T212444.

Isaac moved this task from Staged to In Progress on the Research board.Jul 1 2019, 4:57 PM
Isaac closed this task as Resolved.Aug 23 2019, 10:21 PM

Debiasing complete.

  • See reader behavior features under T228285 for features that were used in debiasing.
  • It was determined that a GradientBoostingClassifier performed best with respect to making the average features -- e.g., average pages viewed per session) for the survey respondents match the general population for the wiki -- though LogisticRegression also worked quite well in many cases.
  • Wikidata instance-of ended up being relatively uninformative so I might revisit that with drafttopic categories.
  • As part of this work, a few changes were required:
    • African and Worldwide surveys (english/french) were separated because I realized that weights from debiasing would not be comparable if they came from two separate models (if a single model was used for english or french, country was a very strong predictor of whether someone took the survey or not)
    • I trimmed the control sessions to exactly match the survey session timespan because the survey was launched / ended mid-day and that meant without careful control, that day of week became a strong predictor of whether someone took the survey or not.