Maniphest T323642

Spark Streaming Dumps POC: Backfill metadata table
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Milimetric
	Nov 22 2022, 7:40 PM

Tags

Referenced Files

None

Subscribers

Description

For the purpose of this POC, the minimal schema can be: (wiki_db string, page_id bigint, revision_id bigint, revision_deleted_parts array<string>). It might be interesting to also include is_latest boolean to keep track of which revision is the latest for a page and see how fast updates to that work in iceberg with our volume.

study https://iceberg.apache.org/docs/latest/configuration/#write-properties
experiment writing sample content from wmf.mediawiki_wikitext_history to an iceberg table backed by parquet files. Here we have to optimize for:
1. as few parquet files as possible
2. fast joins with metadata table (see T323642: Spark Streaming Dumps POC: Backfill metadata table)
3. fast to update from kafka streams of new revisions, page changes, and visibility changes
with results from above, write milimetric.iceberg_wikitext_history with everything available in wmf.mediawiki_wikitext_history. This will mostly be used for performance testing right now, but the schema is the same.
test inserting into milimetric.iceberg_wikitext_history from spark streaming
document everything

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Declined		Milimetric	T320966 Prototype Flink job for content Dumps
		Open		Milimetric	T322326 Prototype Spark Streaming Job for Content Dumps
		Resolved		Milimetric	T323642 Spark Streaming Dumps POC: Backfill metadata table

Event Timeline

Milimetric created this task.Nov 22 2022, 7:40 PM

Milimetric updated the task description. (Show Details)

Milimetric mentioned this in T323641: Spark Streaming Dumps POC: Backfill content table.

Milimetric removed Milimetric as the assignee of this task.Nov 22 2022, 8:15 PM

Milimetric mentioned this in T323645: Spark Streaming Dumps POC: Update iceberg tables.

JArguello-WMF edited projects, added Event-Platform (Sprint 05); removed Event-Platform (Sprint 04).Nov 28 2022, 1:41 PM

Maintenance_bot added a project: Data-Engineering.Nov 28 2022, 1:46 PM

• EChetty edited projects, added Data-Engineering-Planning; removed Data-Engineering.Dec 1 2022, 2:12 PM

Milimetric claimed this task.Dec 1 2022, 5:35 PM

Milimetric moved this task from Next Up to In Progress on the Event-Platform (Sprint 05) board.

Milimetric updated the task description. (Show Details)

Milimetric updated the task description. (Show Details)Jan 6 2023, 8:37 PM

JArguello-WMF moved this task from Sprint 05 to Sprint 07 on the Event-Platform board.Jan 9 2023, 2:27 PM

JArguello-WMF edited projects, added Event-Platform (Sprint 07); removed Event-Platform (Sprint 05).

JArguello-WMF moved this task from Next Up to In Progress on the Event-Platform (Sprint 07) board.

• EChetty moved this task from Backlog to Event Platform on the Data-Engineering-Planning board.Jan 25 2023, 1:49 PM

• EChetty edited projects, added Event-Platform; removed Event-Platform (Sprint 07).

• EChetty moved this task from Backlog to Sprint 07 on the Event-Platform board.

• EChetty edited projects, added Event-Platform (Sprint 07); removed Event-Platform.

JArguello-WMF moved this task from Sprint 07 to Sprint 09 on the Event-Platform board.Jan 27 2023, 8:12 PM

JArguello-WMF edited projects, added Event-Platform (Sprint 09); removed Event-Platform (Sprint 07).

JArguello-WMF edited projects, added Event-Platform (Sprint 10); removed Event-Platform (Sprint 09).Mar 13 2023, 2:17 PM

JArguello-WMF edited projects, added Data Pipelines (Sprint 11); removed Event-Platform (Sprint 10).Mar 20 2023, 5:26 PM

JArguello-WMF edited projects, added Data Pipelines; removed Data Pipelines (Sprint 11).Apr 3 2023, 4:10 PM

JArguello-WMF moved this task from Event Platform to Pipelines on the Data-Engineering-Planning board.Jun 29 2023, 9:46 PM

JArguello-WMF removed a project: Data-Engineering-Planning.Jun 29 2023, 9:59 PM

JArguello-WMF added a project: Data Engineering and Event Platform Team.Jun 30 2023, 5:53 PM

lbowmaker edited projects, added Data-Engineering; removed Data Engineering and Event Platform Team.Nov 10 2023, 2:49 PM

Resolving, we have moved forward with dumps 2.0