Page MenuHomePhabricator

Welcome survey: store aggregates
Closed, ResolvedPublic


As described in the parent task, we can only keep welcome survey responses for one year. We want to store aggregates to help us study how newcomer intentions may change over long periods of time. It will be okay if this task is not accomplished until data has already started being deleted for a few weeks.

Perhaps we will want to store them at the monthly level per wiki.


Due Date
Nov 29 2019, 8:00 AM

Event Timeline

@nettrom_WMF -- I did not specify in here the exact data we will want to store, the timeframe, or the format. I think this can be up to you, but I'm happy to weigh in if needed.

MMiller_WMF updated the task description. (Show Details)

@MMiller_WMF : we have a working prototype for storing aggregates. I think the next step should be that we close this task and open a subtask for productionizing it in collaboration with Analytics Engineering.

@nettrom_WMF -- did we also include storing the aggregates on the topics? Also, is it important to productionize it, or should we stick with the prototype going forward?

@MMiller_WMF : I updated the topic aggregation code when I was working on T246822, making it so we have counts for every checkbox and autocomplete topic, with any user-entered topic counted as "other". While we had deleted some of the data when I ran the updated code, we have data from April to September of last year for the topics (for all other questions we have data from December 2018 onwards).

I'm okay with not productionizing it for now, and we can keep monitoring how much we need updated data on this, particularly as we expand to additional wikis.

@nettrom_WMF -- okay sounds good. Could you please leave a note on this task saying where the aggregates are stored so that we (probably I) can remember where to find them in the future? And then I think it can be resolved. Thank you!

The aggregates are stored in the growth_welcomesurvey database in the Data Lake.

There are five tables in that database, and all tables are split by month, wiki, and platform the user registered on (desktop/mobile). The names of the questions are taken from the saved JSON data, the available options for questions 1, 2, and 4 are also taken from the JSON data. Mapping the options to actual text can be done through inspecting the HTML in the form.

  • monthly_overview: Overview of user groups (e.g. control/survey for wikis where those were used), type of survey response (save/skip/abandon) if in a survey group, and number of users.
  • q1_responses: Responses to the first question on the survey (currently "Why did you create your account today?")
  • q2_responses: Responses to the second question on the survey (currently "Have you ever edited Wikipedia?")
  • q3_responses: Number of users who selected interest in a given topic, for all topics available as checkboxes and through autocomplete. Topic "other" counts the number of topics added through free text. This question is currently not in the survey, labelled as "q3" for historic reasons.
  • q4_responses: Responses to the last question on the survey (currently "Yes, I’m interested" as the answer to the following description and question "We are considering starting a program for more experienced editors to help newer users with editing. Are you interested in being contacted to get help with editing?") If users do not check the box (which is the default), the value is "False", otherwise it's "True".

Closing this task as resolved, but feel free to reopen it if necessary.

@nettrom_WMF -- thanks. What will happen if we add, remove, or change questions on the survey in the future? Like if we were to swap question or ordering, or replace the "have you ever edited" question with one asking about languages? Would new columns appear in the data lake, or new data put in old columns? Or something else?