Page MenuHomePhabricator

Improve the delivery of the movement movements (SDS 2.6.2)
Open, HighPublic

Description

If we

  • implement automations for Movement Insights–owned reporting processes,
  • resolve challenges with upstream data availability, and
  • implement operational changes to insights generation processes,

we will be able to deliver regular insights to decision makers throughout the Foundation more quickly and more reliably.

This is hypothesis SDS 2.6.2 in the Wikimedia Foundation's FY2023-24 annual plan.

Related Objects

StatusSubtypeAssignedTask
Opennshahquinn-wmf
ResolvedJAllemandou
Resolvednshahquinn-wmf
Resolvednshahquinn-wmf
Opennshahquinn-wmf
Invalidnshahquinn-wmf
Opennshahquinn-wmf
DeclinedNone
Opennshahquinn-wmf
Resolvednshahquinn-wmf
Resolvednshahquinn-wmf
Resolvednshahquinn-wmf
Resolvedbrennen
Resolvednshahquinn-wmf
Opennshahquinn-wmf
Opennshahquinn-wmf
Opennshahquinn-wmf
OpenNone
Resolvednshahquinn-wmf
Resolvednshahquinn-wmf
ResolvedHghani
OpenHghani
Resolvednshahquinn-wmf
ResolvedHghani
OpenNone

Event Timeline

I just posted this update on Asana:

Learning the whole workflow of developing an SQL-based ETL Airflow job took a lot longer than anticipated. We also are taking the opportunity to revise the content of the tables, which further slowed the process. However, we've completed the first job (barring issues raised in code review) and should be able to write the remaining jobs to update our intermediate tables within the next month.

However, our original scope of work also calls for turning our manually-run notebooks that compute the metrics and create the visualizations for the metrics report itself into an Airflow job. This would be a Python-based job, which is more complex than an SQL-based job, and furthermore would be doing a task that task which no existing job does. It is unlikely we will be able to accomplish this before the end of the fiscal year.

Our current plan is that we will deem this hypothesis finished with the data pipeline improvements that have been made and the migration of all the intermediate tables, and consider the migration of the reporting notebooks as a separate piece of work in the coming fiscal year.

I just completed the corresponding project in Asana with this report:

  • Final status
    • Proven. We improved the reporting timeline and made significant process improvements, although we only accomplished about half of our original plan.
  • What was accomplished? Include metrics if possible.
    • We sped up the critical path for movement metrics (the Dumps 1.0 → Mediawiki Wikitext History → Knowledge Gaps data pipeline) by an average of 9 days, from 26 days before the change to 17 days afterward. However, the pipeline continues to be slow and erratic due to inherent limitations and one-off problems in the Dumps 1.0 process. More details: T365387.
    • We built an Airflow job to generate an improved version of the intermediate editor_month dataset. As a result, the Movement Insights team now has a strong understanding of Airflow and will be able to build future jobs much more easily. This part took much longer than expected due to:
      • the complexity of Airflow and the customization we have built around it (compounded by incomplete documentation)
      • process issues in creating "stable/reusable" tables as a non–Data Platform Engineering (DPE) team (e.g. T367243)
      • investigating discrepancies between the old and new versions of editor_month
    • We made the reporting process easier and more reliable through:
  • Major lessons
    • Our Airflow setup is very complicated! The documentation should be improved and teams new to Airflow should budget lots of time to learn the system.
    • It would be easier to use the Data Platform if DPE proactively made and documented clear decisions about how other teams should use the resources it maintains (for example, should they use a new or existing Airflow instance? How should "stable/reusable" tables be organized in the Data Lake and HDFS? Should DPE be copied on alert emails?)
    • Our Airflow setup does not support jobs where the output is a report (rather than another dataset). This will limit its adoption. An obvious solution would be for Airflow to be able to run Jupyter notebooks.
    • As is somewhat-widely understood, the edit history contained in the MediaWiki databases is not immutable, which means the same is true of datasets and metrics derived from it. This is almost entirely due to revision deletion and revision importing.
  • Next steps
    • We will finish migrating its Systemd-Puppet ETL jobs to Airflow as essential work, since migrating to Airflow as the single scheduler for all data engineering is an important project for reducing tech debt.
    • We will finish two higher-priority improvements to the calculation and visualization codebase (T361329, T368218) as essential work.
    • We will put together a list of suggested improvements to Airflow documentation for DPE.

I will keep this task open for the remaining work described in "next steps".