Prepare survey data for analysis. This involves cleaning, joining in additional features, and running debiasing code per: https://meta.wikimedia.org/wiki/Research:Characterizing_Wikipedia_Reader_Behaviour/Code#Stage_Two:_Pre-Processing_the_Surveys
- See reader behavior features under T228285 for features that were used in debiasing.
- It was determined that a GradientBoostingClassifier performed best with respect to making the average features -- e.g., average pages viewed per session) for the survey respondents match the general population for the wiki -- though LogisticRegression also worked quite well in many cases.
- Wikidata instance-of ended up being relatively uninformative so I might revisit that with drafttopic categories.
- As part of this work, a few changes were required:
- African and Worldwide surveys (english/french) were separated because I realized that weights from debiasing would not be comparable if they came from two separate models (if a single model was used for english or french, country was a very strong predictor of whether someone took the survey or not)
- I trimmed the control sessions to exactly match the survey session timespan because the survey was launched / ended mid-day and that meant without careful control, that day of week became a strong predictor of whether someone took the survey or not.