Page MenuHomePhabricator

Growth: update welcome survey aggregation schedule
Closed, ResolvedPublic

Description

In T275171: Growth: shorten welcome survey retention to 90 days, we are going to shorten the welcome survey's data retention window to 90 days. That means we may need to alter the code that stores aggregates of welcome survey data so that data isn't lost. The original aggregation was done in T235548: Welcome survey: store aggregates.

In order to unblock T275171: Growth: shorten welcome survey retention to 90 days, we first need to do a manual pull of all available data for all wikis. We'll do that first, so that T275171 can start.

Event Timeline

Milimetric triaged this task as High priority.
Milimetric added a project: Analytics-Kanban.
Milimetric moved this task from Incoming to Security Maturity and Data Privacy on the Analytics board.
nettrom_WMF lowered the priority of this task from High to Medium.
nettrom_WMF edited projects, added Analytics-Radar; removed Analytics-Kanban, Analytics.
nettrom_WMF added a subscriber: mforns.

Moving this to the Analytics radar and reassigning to me. The Welcome Survey aggregation needs to be streamlined and set up for monthly updates, and this task is for me to get that done. I also changed the parent task to T275171, which is the task for the Growth engineers to pick up when the aggregation is done and we can safely shorten the window.

Oh, we groomed this task and assigned it to me by mistake, thanks for fixing :]

@mforns : Sure thing! Sorry about the confusion, once I noticed that the parent task was set to the one that I created to summarize all the EventLogging-related work it made sense why this would look like another task for someone from AE to work on.

The first part of this work is now completed, and @Tgr can go ahead with T275171 and shorten the survey retention.

@MMiller_WMF : while working on this I found that the last time data was pulled was in March 2020, grabbing complete data until the end of February that year. Because of the 270 day retention, this means that we've lost March through June 2020, and have partial data for July 2020.

There are a few smaller tasks remaining to wrap up this work:

  1. Add the previous aggregated question responses to the new aggregate table in the data lake.
  2. Remove the mentor question from the notebook, because it's no longer in the survey.
  3. Set up a cron job (or something similar) to run the notebook and add aggregate data to the data lake once every month.

Once all of that is complete, it'll mean that 1) if we change the questions in the survey we'll also need to modify the notebook accordingly, and 2) as we scale and add the survey to additional wikis, we'll also need to update the notebook (or add some code that can automatically identify the wikis that have the survey configured).

Change 673631 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[operations/puppet@production] Update GrowthExperiments cronjob parameters

https://gerrit.wikimedia.org/r/673631

The first part of this work is now completed, and @Tgr can go ahead with T275171 and shorten the survey retention.

T275171: Growth: shorten welcome survey retention to 90 days is done now.

This work is 90% done. The notebook is updated and ready for automatic monthly aggregation, I'll get it up on GitHub later today. Currently our data is scattered across multiple small files and in need of compaction to be performant in Hadoop, which I'm looking into. Moving it into "blocked" on the PA board until further progress can be made.

This work is completed and the notebook is on GitHub: https://github.com/nettrom/Growth-welcomesurvey-2018/blob/master/T275172_survey_aggregation.ipynb

It's set up to run monthly, and I'll be monitoring it to see if any additional work is needed.