[Dashboard] Migrate golden to Reportupdater infrastructure
Closed, ResolvedPublic50 Story Points

Description

We need our golden data retriever codebase to be more modular and have better support for addition of new modules, which will be very helpful when we start doing ZRR calculation for "well-behaved searches" (see T150370 & T150901). Testing if a new script works and backfilling missing data is a huge pain right now.

After talking with Analytics, Chelsy and I decided to migrate our codebase to use their Reportupdater infrastructure as it seems to meet our needs. This will require the following steps:

  • Rewrite as many EventLogging (EL) based scripts to be pure SQL
  • Rewrite current pure-R scripts be shell scripts + R and use Reportupdater conventions
  • Update column names in current datasets
  • Finalize (test the heck out of) Reportupdater-based codebase
  • Code review by @chelsyx
  • Deploy & schedule for daily execution
  • Prepare dashboards for new formats/naming conventions (all in CR)
  • Deploy dashboards after Reportupdater-based refactor of golden has completed at least one successful run
mpopov created this task.Nov 16 2016, 11:49 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 16 2016, 11:49 PM
mpopov triaged this task as "Normal" priority.Nov 16 2016, 11:57 PM
mpopov added a project: Discovery-Analysis.

Change 322051 had a related patch set uploaded (by Bearloga):
Easier backfilling [WIP]

https://gerrit.wikimedia.org/r/322051

mpopov added a comment.EditedNov 28 2016, 10:32 PM

Side-update: Chelsy and I spoke with Nuria and Dan from Analytics about possibly migrating to their Reportupdater infrastructure, but had the following concerns that we're waiting to hear back on:

  1. A lot of our datasets that we generate are in long format rather than wide format, with multiple rows for each date. For example, zero results rate on a project-language pair basis, with project as a column, language as a column, and rate as a column -- which yields hundreds of data points for any particular date. We can't force all of our datasets to be wide (which seems to be what Reportupdater requires), especially datasets with ever-changing user agent breakdowns.

Dan: The funnel setting is meant to address reports that can't easily be wide. Bad name, I know, but we use it and it works well.

  1. Some of our metrics require us to only keep a rolling N-day window. For example, that zero results rate broken down by project-language pairs since there are so many of those pairs that keeping hundreds of observations per day would make our dashboards very slow to use. Can Reportupdater be made to recognize when we want to delete rows where date is older than N days?

Dan: Yes, we have a max data points setting that cuts off data after a certain amount is collected. Takeaway for us would be to make sure it works as expected with the funnel reports (by counting distinct dates not rows)

  1. You mentioned that the stats user on stat1002 could have a personal library of R packages like Chelsy and I do. If we add a new script/dataset to a reportupdater repo that requires a new package or a new version of an existing package, how simple/fast is the process for installing/updating R packages in stats user's library. Right now it's super easy and fast for me/us to start using new packages when we add new metrics, since all our scripts run under my account on stat1002.

Dan: The process to install a new R package can vary a lot depending on what it takes to build the deb package (I think). This is better addressed by the opsier folks.

mpopov edited the task description. (Show Details)

Change 322051 abandoned by Bearloga:
Easier backfilling [WIP]

Reason:
Abandoning in favor of migrating codebase to Analytics' Reportupdater infrastructure.

https://gerrit.wikimedia.org/r/322051

mpopov changed the title from "[Dashboard] Make backfilling data easier" to "[Dashboard] Migrate golden to Reportupdater infrastructure".Dec 5 2016, 8:20 PM
mpopov edited the task description. (Show Details)

Change 325870 had a related patch set uploaded (by Bearloga):
[WIP] Migrate to Reportupdater framework

https://gerrit.wikimedia.org/r/325870

mpopov edited the task description. (Show Details)Dec 15 2016, 8:59 PM

Running the test utility right now to check that everything is working OK before I officially ask Chelsy for CR.

mpopov edited the task description. (Show Details)Jan 9 2017, 7:08 PM
mpopov changed the point value for this task from 10 to 50.
mpopov added a subscriber: chelsyx.

Change 335575 had a related patch set uploaded (by Bearloga):
Point to new datasets

https://gerrit.wikimedia.org/r/335575

Change 335576 had a related patch set uploaded (by Bearloga):
Point to new datasets

https://gerrit.wikimedia.org/r/335576

Change 335579 had a related patch set uploaded (by Bearloga):
Point to new datasets and add LDF

https://gerrit.wikimedia.org/r/335579

mpopov edited the task description. (Show Details)Feb 2 2017, 10:23 PM

Change 335746 had a related patch set uploaded (by Bearloga):
Point to new datasets

https://gerrit.wikimedia.org/r/335746

Change 336350 had a related patch set uploaded (by Bearloga):
Point to new datasets & add new metrics

https://gerrit.wikimedia.org/r/336350

mpopov edited the task description. (Show Details)Feb 7 2017, 1:13 AM

Change 335575 merged by Bearloga:
Point to new datasets

https://gerrit.wikimedia.org/r/335575

Change 335579 merged by Bearloga:
Point to new datasets and add LDF

https://gerrit.wikimedia.org/r/335579

Change 335576 merged by Bearloga:
Point to new datasets

https://gerrit.wikimedia.org/r/335576

Change 335746 merged by Bearloga:
Point to new datasets

https://gerrit.wikimedia.org/r/335746

Change 336350 merged by Bearloga:
Point to new datasets & add new metrics

https://gerrit.wikimedia.org/r/336350

Change 341373 had a related patch set uploaded (by chelsyx):
[wikimedia/discovery/prince] Fixed bug in tab country_breakdown

https://gerrit.wikimedia.org/r/341373

Change 341373 merged by Bearloga:
[wikimedia/discovery/prince] Fixed bug in tab country_breakdown

https://gerrit.wikimedia.org/r/341373

Change 325870 merged by Bearloga:
[wikimedia/discovery/golden] Migrate to Reportupdater framework

https://gerrit.wikimedia.org/r/325870

Change 341746 had a related patch set uploaded (by Bearloga):
[wikimedia/discovery/twilightsparql] Annotate Reportupdater migration on graphs

https://gerrit.wikimedia.org/r/341746

Change 341745 had a related patch set uploaded (by Bearloga):
[wikimedia/discovery/wetzel] Annotate Reportupdater migration on graphs

https://gerrit.wikimedia.org/r/341745

Change 341744 had a related patch set uploaded (by Bearloga):
[wikimedia/discovery/wonderbolt] Annotate Reportupdater migration on graphs

https://gerrit.wikimedia.org/r/341744

Change 341743 had a related patch set uploaded (by Bearloga):
[wikimedia/discovery/prince] Annotate Reportupdater migration on graphs

https://gerrit.wikimedia.org/r/341743

Change 341742 had a related patch set uploaded (by Bearloga):
[wikimedia/discovery/rainbow] Annotate Reportupdater migration on graphs

https://gerrit.wikimedia.org/r/341742

Change 341742 merged by Chelsyx:
[wikimedia/discovery/rainbow] Annotate Reportupdater migration on graphs

https://gerrit.wikimedia.org/r/341742

Change 341743 merged by Chelsyx:
[wikimedia/discovery/prince] Annotate Reportupdater migration on graphs

https://gerrit.wikimedia.org/r/341743

Change 341744 merged by Chelsyx:
[wikimedia/discovery/wonderbolt] Annotate Reportupdater migration on graphs

https://gerrit.wikimedia.org/r/341744

Change 341745 merged by Chelsyx:
[wikimedia/discovery/wetzel] Annotate Reportupdater migration on graphs

https://gerrit.wikimedia.org/r/341745

Change 341746 merged by Chelsyx:
[wikimedia/discovery/twilightsparql] Annotate Reportupdater migration on graphs

https://gerrit.wikimedia.org/r/341746

Change 341935 had a related patch set uploaded (by Bearloga):
[wikimedia/discovery/dashboard] Deploy post-Reportupdater dashboards

https://gerrit.wikimedia.org/r/341935

Change 341935 merged by Bearloga:
[wikimedia/discovery/dashboard] Deploy post-Reportupdater dashboards

https://gerrit.wikimedia.org/r/341935

mpopov edited the task description. (Show Details)Mar 8 2017, 11:49 PM

stat1002:/a/discovery/golden has been updated

All the dashboards have been deployed. They use the new datasets located in stat1002:/a/aggregate-datasets/discovery

The pre-migration datasets have been archived in /a/discovery/legacy-datasets/pre-reportupdater

I will send an announcement email to discovery-l, since there are several new metrics being tracked on the WDQS and Portal dashboards.

debt added a subscriber: debt.Mar 9 2017, 12:32 AM

Nice job! :)

debt closed this task as "Resolved".

...and wrapping this up! :) Good work!