Page MenuHomePhabricator

Stop Refining mediawiki_job events in Hive
Closed, ResolvedPublic


We currently import all mediawiki.job.* streams into HDFS via Camus, and then Refine a limited set of into the Hive event database. Many of these can't be refined because they don't have a real schema. Those that are refined are using schema inference rather than a fully typed JSONSchema.

As far as I know, the job queue events have never been used in Hive. Refining them is error prone and sometimes causes confusion for Data Engineers on their ops week duty.

We can keep importing the raw JSON events via Camus and if we need to use them for troubleshooting some job queue issue we can still use them, just not as easily as we could if they were refined.

Event Timeline

fdans added a project: Analytics-Kanban.

If there are no objections, we will stop refining these and remove them from the event database during the week of May 10.

Change 689866 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] refine - Ensure mediawiki_job refine is absent

Change 689866 merged by Ottomata:

[operations/puppet@production] refine - Ensure mediawiki_job refine is absent

Mentioned in SAL (#wikimedia-analytics) [2021-05-12T13:56:17Z] <ottomata> removing refine_mediawiki_job Refine jobs - T281605

Change 689870 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] refine - Remove absented refine_mediawiki_job

Change 689870 merged by Ottomata:

[operations/puppet@production] refine - Remove absented refine_mediawiki_job

Should we delete all mediawiki_job tables and data now? I think so.

The description says we're keeping the raw JSON import, just not the rest of the pipeline. I agree to delete any of it, unused data is just confusing, just making sure everyone expects the same thing