Page MenuHomePhabricator

Transition Campaigns-Product's monthly metrics reporting from PA to CP
Closed, ResolvedPublic

Description

Product Analytics (PA) has been outputting campaigns monthly metrics on a monthly basis, on the first Friday of the month and outputting metrics into the metrics sheet. @Iflorez is now to transition the running queries for monthly reporting to the Campaigns Product (CP) team. Product Analytics can continue to consult on the data and interpretation as needed.

Plan
  • Inform the team of the change
  • Discuss timing
  • Review options:
    • Run current JupyterLab notebook in Jupyter Lab
    • Run a version of the notebook as a python executable file
    • Since Jan 24 x1 cluster wikishared dBs are accessible on Superset's SQL lab -- run queries on SQL Lab and output to dashboard OR run all in one go and download data to csv to import the new row of monthly data into the existing sheet.
    • crontab?
    • ?
  • Until we have a set transition date where Claudio is ready to begin outputting metrics, Irene will continue to output the metrics.

Acceptance Criteria

  • meet &greet for @cmelo to understand the current process and @Iflorez to understand how Claudio might be running/setting up on his end
  • identify Superset dashboarding with virtual dataset - potential concerns/risks and share with the team
  • review with KC
  • ask about superset dashboarding on the Slack data-engineering channel
  • update queries per Slack discussion; confer with @ifried
  • get all the artifacts in place (queries, notebooks etc)
  • T374500: queries & HQL files on Gitlab
  • group meet to review options
  • engineers do a dry run to understand the process
  • discussions/meetings where helpful.
  • [Potentially] spikes on the campaigns engineering side / devoted time to automate this more.

Event Timeline

Background

A monthly first Friday run to sheets-output was the interim solution since T365404 was closed.

Research & Notes

Superset’s SQL Lab works for writing, running, and exploring SQL queries and saving them as virtual datasets, but it isn’t designed for full, multi-step ETL/orchestration. A virtual dataset in Superset is essentially a saved SQL query plus some metadata (metrics, calculated-columns, default filters, cache settings, etc.). Every time you open a chart or dashboard that uses it, Superset will fire off that SQL (or pull from its cache) against the warehouse. Virtual datasets run on demand, so we might notice long load times whenever the underlying query is heavy or the cache has expired.
Downsides: There's no consistent performance or built-in observability. Load time can increase as the queries increase in complexity and as queries are added to a dashboard (when on a dashboard).
A materialized table built by a pipeline (orchestrated by Airflow) will usually return much faster, since it’s just “SELECT * FROM my_prebuilt_table” rather than a complex join/aggregation every time. And it has retry logic and history to observe any issues encountered.

Queries currently pull from:
campaign_events CORE METRIC
ce_participants CORE METRIC
ce_organizers CORE METRIC
table for gauging newcomer reg dates
ce_invitation_lists (nice to have?)

Iflorez updated the task description. (Show Details)

Potential issues with running on Superset:

  • Not all of the metrics can be calculated on Superset. The team will still need to grab the count of new-campaign-editors using another tool.
  • Maintenance and data issues will likely apply at some point (see the highlighted chunk in the doc). Other issues may apply as well.

Quick updates:

  • Data engineering notes—Heads-up: keep queries under 60 seconds as Superset will time out aggressively thereafter AND keep in mind that this will be an internal-access dashboard.
  • @cmelo created a Superset dashboard in T399006 which has 7 charts (Superset virtual datasets) with last_x_months views.
  • new-campaign-event-participants which is not reported on at a present and which is used for general team tracking, will be output by @cmelo on a monthly basis.

@cmelo will review the latest update before conferring with the team. We'll sync thereafter and confirm timeline.

This work was completed in the first half of Q1. The dashboard is now available for on demand view.

@cmelo is now outputting New Accounts Created Per Month counts every month and adding this data to the dashboard.

Iflorez updated the task description. (Show Details)