Implement production mechanism that emits (wiki_db, revision_id) pairs for missing or inaccurate rows
Open, In Progress, HighPublic
Actions

Assigned To

Authored By

	xcollazo
	Jun 28 2024, 4:10 PM

Description

On T367570: Spike: Figure feasability to emit (wiki_db, revision_id) pairs, we PoCed a mechanism to emit (wiki_db, revision_id) pairs. In this task we should make this code production quality.

Output:

a PySpark job that can run similar checks as in T367570, but that is parameterized properly. Done via T368754.
- this job should have a sink_table parameter. Final table name: wmf_dumps.wikitext_inconsistent_rows
- Think about and define the DDL of wmf_dumps.wikitext_inconsistent_rows so that it is also usable from the point of view of data quality metrics.
A separate job that reads from wmf_dumps.wikitext_inconsistent_rows and calls EventGate. Done via T368755.
An Airflow job that orchestrates all of this. Core of work done via T368756.
Figure a performant way to read all data from revision table via Spark ( T372677 )
Add a new hourly Spark MERGE INTO job that consumes the page_content_late_change hive table. ( T368746 )

Related Objects
Search...

Status	Assigned	Task
Open	xcollazo	T358877 Dumps 2.0 Phase II: Production intermediate table milestone
Open	None	T358373 [Dumps 2] Reconciliation mechanism to detect and fetch missing/mismatched revisions
In Progress	xcollazo	T368753 Implement production mechanism that emits (wiki_db, revision_id) pairs for missing or inaccurate rows
Resolved	xcollazo	T368754 Production PySpark job that can run consistency checks for wmf_dumps.wikitext_raw
In Progress	xcollazo	T368755 Python job that reads from wmf_dumps.wikitext_inconsistent_row and produced reconciliation events.
Open	xcollazo	T378122 Table maintenance for wmf_dumps.wikitext_inconsistent_row is failing
Open	None	T379676 Add relevant kafka clusters to defined airflow connections in puppet
Duplicate	None	T379968 noc.wikimedia.org is slow and it times out sporadically
Open	None	T380142 Reimaging a kubernetes control-plane invalidates service-account tokens issued by it
Resolved	xcollazo	T368756 Airflow job to orchestrate the dumps reconcilliation emission mechanism
In Progress	Milimetric	T372677 Figure a performant way to read all data from revision table via Spark
Duplicate	None	T378603 Some wikis have revision rows where rev_timestamp is blank
Resolved	xcollazo	T369868 Improve handling of delete, restore, and merge from incremental update
Duplicate	None	T375077 Add a new hourly Spark MERGE INTO job that consumes the page_content_late_change hive table.
Open	None	T377852 Tune Reconciliation mechanism to do historic runs (all revisions, all wikis)
Open	gmodena	T368746 Implement a new MERGE INTO job that consumes the new reconciliation stream into wmf_dumps.wikitext_raw

Event Timeline

xcollazo renamed this task from Implement PySpark job that emits (wiki_db, revision_id) pairs for missing or inaccurate rows to Implement job that emits (wiki_db, revision_id) pairs for missing or inaccurate rows.Jun 28 2024, 4:10 PM

xcollazo created this task.

xcollazo renamed this task from Implement job that emits (wiki_db, revision_id) pairs for missing or inaccurate rows to Implement production mechanism that emits (wiki_db, revision_id) pairs for missing or inaccurate rows.Jun 28 2024, 4:13 PM

xcollazo updated the task description. (Show Details)

xcollazo updated the task description. (Show Details)Jun 28 2024, 4:16 PM

xcollazo mentioned this in T368756: Airflow job to orchestrate the dumps reconcilliation emission mechanism.

xcollazo mentioned this in T368754: Production PySpark job that can run consistency checks for wmf_dumps.wikitext_raw.

xcollazo mentioned this in T368755: Python job that reads from wmf_dumps.wikitext_inconsistent_row and produced reconciliation events..

xcollazo mentioned this in T368746: Implement a new MERGE INTO job that consumes the new reconciliation stream into wmf_dumps.wikitext_raw.Jun 28 2024, 7:27 PM

xcollazo changed the task status from Open to In Progress.Jul 1 2024, 3:45 PM

xcollazo claimed this task.

xcollazo triaged this task as High priority.

xcollazo changed the status of subtask T368754: Production PySpark job that can run consistency checks for wmf_dumps.wikitext_raw from Open to In Progress.

• lbowmaker moved this task from Sprint Backlog to In Process on the Dumps 2.0 (Kanban Board) board.Jul 22 2024, 1:26 PM

xcollazo moved this task from In Process to Sprint Goals on the Dumps 2.0 (Kanban Board) board.Jul 24 2024, 2:51 PM