Spark Streaming Dumps POC: Backfill content table
Open, Needs TriagePublic
Actions

Assigned To

Authored By

	Milimetric
	Nov 22 2022, 7:34 PM

Description

For the purpose of this POC, the minimal schema can be: (wiki_db string, page_id bigint, revision_id bigint, revision_text string)

study https://iceberg.apache.org/docs/latest/configuration/#write-properties
experiment writing sample content from wmf.mediawiki_wikitext_history to an iceberg table backed by avro files. The iceberg table will eventually be configured to optimize for:
1. as few avro files as possible
2. fast joins with metadata table (see T323642: Spark Streaming Dumps POC: Backfill metadata table)
3. fast to find content for a given (wiki_db, page_id)
with results from above, write <some_test_db>.ice_wikitext with everything available in wmf.mediawiki_wikitext_history. This will mostly be used for performance testing right now, so it doesn't need to have more than the minimal schema from above.
run the job to fill a semi-final table in the wmf db
document job parameters and explain how the spark job was tuned in detail

NOTE: while investigating this, one interesting question is: should we have another table with (wiki_db, page_id, cached_xml_dumps_output) that would be updated only when a page sees new revisions?

Related Objects
Search...

Status	Assigned	Task
Declined	Milimetric	T320966 Prototype Flink job for content Dumps
Open	Milimetric	T322326 Prototype Spark Streaming Job for Content Dumps
Open	MunizaA	T323641 Spark Streaming Dumps POC: Backfill content table

Event Timeline

Milimetric created this task.Nov 22 2022, 7:34 PM

Milimetric updated the task description. (Show Details)Nov 22 2022, 7:40 PM

Milimetric updated the task description. (Show Details)

Milimetric removed Milimetric as the assignee of this task.Nov 22 2022, 8:15 PM

Milimetric mentioned this in T323645: Spark Streaming Dumps POC: Update iceberg tables.

JArguello-WMF edited projects, added Event-Platform (Sprint 05); removed Event-Platform (Sprint 04).Nov 28 2022, 1:42 PM

Maintenance_bot added a project: Data-Engineering.Nov 28 2022, 1:46 PM

• EChetty edited projects, added Data-Engineering-Planning; removed Data-Engineering.Dec 1 2022, 2:12 PM

Milimetric assigned this task to MunizaA.Dec 1 2022, 5:29 PM

Milimetric moved this task from Next Up to In Progress on the Event-Platform (Sprint 05) board.

Milimetric updated the task description. (Show Details)Dec 20 2022, 2:12 PM

JArguello-WMF moved this task from Sprint 05 to Sprint 07 on the Event-Platform board.Jan 9 2023, 2:27 PM

JArguello-WMF edited projects, added Event-Platform (Sprint 07); removed Event-Platform (Sprint 05).

JArguello-WMF moved this task from Next Up to In Progress on the Event-Platform (Sprint 07) board.

JArguello-WMF moved this task from In Progress to In Review on the Event-Platform (Sprint 07) board.Jan 9 2023, 3:05 PM

JArguello-WMF edited projects, added Event-Platform; removed Event-Platform (Sprint 07).Jan 27 2023, 8:08 PM

JArguello-WMF moved this task from Backlog to Sprint 09 on the Event-Platform board.Jan 27 2023, 8:12 PM

JArguello-WMF edited projects, added Event-Platform (Sprint 09); removed Event-Platform.

lbowmaker moved this task from Backlog to Event Platform on the Data-Engineering-Planning board.Mar 3 2023, 2:40 PM

lbowmaker edited projects, added Event-Platform; removed Event-Platform (Sprint 09).

JArguello-WMF removed a project: Data-Engineering-Planning.Jun 29 2023, 9:48 PM

Restricted Application added a project: Data-Engineering. · View Herald TranscriptJun 29 2023, 9:48 PM

JArguello-WMF moved this task from Incoming (new tickets) to Event Platform Backlog on the Data-Engineering board.Jun 29 2023, 10:31 PM

JArguello-WMF added a project: Data Engineering and Event Platform Team.Jun 30 2023, 4:29 PM

JArguello-WMF moved this task from Data Eng Backlog to Event Platform Backlog on the Data Engineering and Event Platform Team board.Jun 30 2023, 4:38 PM

Ottomata moved this task from Backlog to Radar on the Event-Platform board.Oct 3 2023, 4:19 PM

lbowmaker removed a project: Data Engineering and Event Platform Team.Nov 10 2023, 2:29 PM

Spark Streaming Dumps POC: Backfill content tableOpen, Needs TriagePublicActions

Description

Related ObjectsSearch...

Event Timeline

Spark Streaming Dumps POC: Backfill content table
Open, Needs TriagePublic
Actions

Related Objects
Search...