Page MenuHomePhabricator

[Update Pipeline] wikidata_coeditors
Open, Needs TriagePublic3 Estimated Story Points

Description

Dependencies

No downstream changes needed

  • HQL logic that needs to change
    • hql needs to include only logged-in users. Will assume mediawiki_history has event_user_is_permanent column.
  • HQL table creation scripts that need to change: none
  • Deployment plan script
    • <<plan steps>>
  • Airflow DAG that schedules the HQL logic
    • main dag (change properties passed in to allow vetting in the parallel data pipeline)

Testing notes

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Use new coeditors_metrics hql filerepos/data-engineering/airflow-dags!899ebysanstempacc_coeditorsmain
Customize query in GitLab

Event Timeline

Milimetric set the point value for this task to 3.Oct 16 2024, 3:59 PM
Milimetric renamed this task from wikidata_coeditors to [Update Pipeline] wikidata_coeditors.Oct 21 2024, 5:27 PM

Change #1084282 had a related patch set uploaded (by Snwachukwu; author: Snwachukwu):

[analytics/refinery@master] Add new coeditors_metrics hql file to support temp account changes.

https://gerrit.wikimedia.org/r/1084282

We talked some more about the testing strategy and decided to just keep the changes in the same file. The testing will be connected and involved anyway. Sorry for the confusion, I know we said to do it in parallel at the beginning. Also, feel free to still argue for that if you think it's a good idea.

Right now the WMDE folks are leaning towards stopping this job altogether. With the limitations of Prometheus they're looking to build a new coeditors metric on Airflow. For now we have to wait until December 9th when @Lydia_Pintscher can give the final ok to turn off this old version. If that happens, we can change this task to decommission.

Sorry. I overlooked this.
I have the last data I needed. So from my side it can go now.

@AndrewTavis_WMDE does Wikidata need any of these Hive2Graphite jobs anymore then? Can we turn them all off? Or are some of these still useful?

Hey @Milimetric 👋 Thanks for reaching out about this!

Our coeditor metric process is now 100% Airflow and Hive jobs via wd_coeditors_monthly_dag.py and the corresponding job queries in wmde/analytics/hql/airflow_jobs/wd_coeditors/. From that I'd say that wikidata_coeditors_metrics_to_graphite_monthly_dag.py is certainly unneeded.

The other instances of Hive2Graphite are in apis_metrics_to_graphite_hourly_dag.py and wikidata_metrics_to_graphite_daily_dag.py. I'm really not sure on apis_metrics_to_graphite_hourly_dag.py as this seems more general? I'm assuming you meant the Wikidata specific ones, but happy to dig into this one more :)

For wikidata_metrics_to_graphite_daily_dag.py we have various metrics that we need to check that the Graphite to Prometheus migration has been completed:

  • ArticlePlaceholder
  • Reliability
  • specialentity_data
  • Special:EntitySchemaText
  • EntitySchema namespace

This is broadly being done within the T371616: [EPIC][GRAFMIGR] Spruce up Wikidata Grafana Metrics. I won't have time to finalize this today and am off tomorrow, but I'll plan on looking into the above early next week!

Hi @AndrewTavis_WMDE, I'm just following up the wikidata_metrics_to_graphite_daily_dag.py. As part of T372855, Can we go ahead to turn this dag off?

Hi @Snwachukwu 👋

In finalizing the overview of wikidata_metrics_to_graphite_daily_dag.py, the check is if a prior Graphite to Prometheus migration has been done via statsd not being in the associated PHP files unless it's copyToStatsdAt and StatsFactory being in the associated PHP files - so generally search for stats* in the related files and see what comes up. I'm also checking which HQL files are involved in the process and what the final Grafana boards are :)

Results on the above

Questions that I have

  • Does what I explained above for CounterMetric within the Extension:ArticlePlaceholder code indicate that the Graphite to Prometheus migration has happened?
  • @Lydia_Pintscher 👋 Some requests for confirmation for you :) Can you let us know if the data for the following Grafana boards is still needed?
    1. Wikidata Reliability Metrics
    2. Wikidata Special:EntityData
    3. EntitySchema
    4. I'm assuming Article Placeholder certainly is, and this one is different as we should be able to maintain it on Grafana (see next steps below)

Possible next steps

For Reliability metrics, specialentity_data metrics, Special:EntitySchemaText metrics and EntitySchema namespace metrics: If memory serves me we won't be able to port these HQL results into Prometheus if they are still needed. If we still do need the data then maybe we can do an interim solution where the metrics are just saved to the data lake? Would we want to extract their processes from the wikidata_metrics_to_graphite_daily_dag.py DAG into a DAG of their own?

  • This would allow the old wikidata_metrics_to_graphite_daily_dag.py DAG to be turned off and new DAGs only for the needed data to be ran
  • We could maintain these DAGs within wmde on Airflow if that would be preferable

Please let me know on the above! Looks like there might be a need for some sub tasks here if we do want to keep some of the other metrics. Happy to help make those!

This is a question for the EMs. Pinging @WMDE-leszek and @karapayneWMDE. But my assumption is that yes, we do want to keep this.

  1. Wikidata Special:EntityData
  2. EntitySchema
  3. I'm assuming Article Placeholder certainly is, and this one is different as we should be able to maintain it on Grafana (see next steps below)

All 3 of them we want to keep. The 3rd one is the least important of them.

Thank you, @Lydia_Pintscher! Quick notes for those just subscribed:

  • The Wikidata Reliability Metrics, Wikidata Special:EntityData and EntitySchema will likely need to be dropped from Grafana for an interim period of time as Hive to Prometheus doesn't function as Hive to Graphite did (speaking from past tasks where we have made similar decisions).
  • The Article Placeholder data should be able to be maintained on Grafana, but I'm waiting to hear back on questions of whether the migration has been completed or not
    • If not, then this would be added into the work in T371616

Hi again, @Snwachukwu 👋

Clearing up some of the above based on discussions I've been having. Firstly ArticlePlaceholder metrics being converted to Prometheus data tracking or not is not important for this task as there is also an HQL script that's for these specific metrics - specifically wikidata_articleplaceholder_metrics.hql. Sorry for the confusion. I'm bouncing a bit between data engineering and MediaWiki development for the Graphite deprecation and my head went to MediaWiki at first :)

We need to do a conversion of all metrics in analytics/refinery/wikidata before wikidata_metrics_to_graphite_daily_dag.py can be turned off. These scripts currently create temporary views and then export directly to Graphite. The general plan for this would be:

  • Iceberg tables within the wmde namespace would be created for each of these processes
  • I'd bring the analytics/refinery/wikidata scripts over to gitlab:repos/wmde/analytics and edit them to insert into the new tables
  • Each of the processes would then have their own DAG deployed to the wmde Airflow instance
  • At this point wikidata_metrics_to_graphite_daily_dag.py could be turned off as we'd have processes to collect the new data
  • In a separate task I would migrate the old Graphite data into the respective Iceberg tables

I'd suggest that we set a deadline for the new DAGs being deployed to the end of March. @Snwachukwu, @Milimetric: Would this timeline work for you all?

CC @Ifrahkhanyaree_WMDE as we need this to be prioritized to not block the deprecation of Graphite. Happy to make the tasks and add them to incoming on the Kanban! 😊

@AndrewTavis_WMDE The plan is to go read-only (Graphite) by the end of Q3-FY24/25. We can hold off turning off the wikidata_metrics_to_graphite_daily_dag.py until towards the end of this quater.

GitLab:data-engineering/airflow-dags#1185 adds five new DAGs for the following processes within wikidata_metrics_to_graphite_daily_dag that will populate tables in the Data Lake:

Tests have passed except for the approval job. The plan is to merge and auto-deploy tomorrow and then run the metrics from the 1st of January on. We'll then be good to disable wikidata_metrics_to_graphite_daily_dag by the March 28th deadline.

GitLab:data-engineering/airflow-dags#1185 is merged, all five DAGs are deployed and we're all green for daily job success from January 1st 2025 on 💚 We're thus ready to turn off wikidata_metrics_to_graphite_daily_dag :)