Page MenuHomePhabricator

Implement production mechanism that emits (wiki_db, revision_id) pairs for missing or inaccurate rows
Open, In Progress, HighPublic

Description

On T367570: Spike: Figure feasability to emit (wiki_db, revision_id) pairs, we PoCed a mechanism to emit (wiki_db, revision_id) pairs. In this task we should make this code production quality.

Output:

  • a PySpark job that can run similar checks as in T367570, but that is parameterized properly. Done via T368754.
    • this job should have a sink_table parameter. Final table name: wmf_dumps.wikitext_inconsistent_rows
    • Think about and define the DDL of wmf_dumps.wikitext_inconsistent_rows so that it is also usable from the point of view of data quality metrics.
  • A separate job that reads from wmf_dumps.wikitext_inconsistent_rows and calls EventGate. Done via T368755.
  • An Airflow job that orchestrates all of this. Core of work done via T368756.
  • Figure a performant way to read all data from revision table via Spark ( T372677 )
  • Add a new hourly Spark MERGE INTO job that consumes the page_content_late_change hive table. ( T368746 )

Related Objects

StatusSubtypeAssignedTask
Openxcollazo
OpenNone
In Progressxcollazo
Resolvedxcollazo
In Progressxcollazo
Openxcollazo
OpenNone
DuplicateNone
OpenNone
Resolvedxcollazo
In ProgressMilimetric
DuplicateNone
Resolvedxcollazo
DuplicateNone
OpenNone
Opengmodena

Event Timeline

xcollazo renamed this task from Implement PySpark job that emits (wiki_db, revision_id) pairs for missing or inaccurate rows to Implement job that emits (wiki_db, revision_id) pairs for missing or inaccurate rows.Jun 28 2024, 4:10 PM
xcollazo created this task.
xcollazo renamed this task from Implement job that emits (wiki_db, revision_id) pairs for missing or inaccurate rows to Implement production mechanism that emits (wiki_db, revision_id) pairs for missing or inaccurate rows.Jun 28 2024, 4:13 PM
xcollazo updated the task description. (Show Details)
xcollazo changed the task status from Open to In Progress.Jul 1 2024, 3:45 PM
xcollazo claimed this task.
xcollazo triaged this task as High priority.