Maniphest T324108

[SPIKE] Use Flink for batch backfilling
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	• lbowmaker
	Nov 30 2022, 2:22 PM

Tags

Referenced Files

None

Subscribers

Description

User Story

As a platform engineer, I need to experiment with developing a Flink batch job, ideally using same / similar code as a realtime streaming job. (TBD - can this by in PyFlink?)

Why?

This will help us understand if we can use a bounded Flink job for backfilling datasets and also help inform us if this approach would be easy enough for others who want to analyze larger datasets using this approach

Done is:

Job is set to consume page change stream with start and end bounds (for some small arbitrary timeframe - last 2 days?)
Job returns a count of all events in that bounded timeframe
Job ends when all events are consumed
Short demo video of job running

Related Objects

Mentioned In: T340861: Implement a backfill job for the dumps hourly table
T335860: Implement job to transform mediawiki.page_content_change
T338231: [Event Platform] mw-page-content-change-enrich should (re)produce kafka keys
T338169: mw-page-content-change-enrich should partition by and process by wiki_id,page_id

Event Timeline

• lbowmaker created this task.Nov 30 2022, 2:22 PM

Restricted Application added a project: Data-Engineering. · View Herald TranscriptNov 30 2022, 2:22 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Ottomata renamed this task from [SPIKE] Use Flink to develop bounded service to [SPIKE] Use Flink for batch backfilling.Nov 30 2022, 2:59 PM

Ottomata updated the task description. (Show Details)

• EChetty edited projects, added Data-Engineering-Planning; removed Data-Engineering.Dec 1 2022, 2:12 PM

• EChetty moved this task from Backlog to Event Platform on the Data-Engineering-Planning board.Jan 6 2023, 12:45 PM

• lbowmaker moved this task from Backlog to To be Estimated/To be discussed on the Event-Platform board.Jan 27 2023, 3:25 PM

JArguello-WMF moved this task from To be Estimated/To be discussed to Backlog on the Event-Platform board.Feb 9 2023, 1:12 PM

Ottomata mentioned this in T338169: mw-page-content-change-enrich should partition by and process by wiki_id,page_id.Jun 5 2023, 4:57 PM

Ottomata mentioned this in T338231: [Event Platform] mw-page-content-change-enrich should (re)produce kafka keys.Jun 6 2023, 1:26 PM

Ottomata mentioned this in T335860: Implement job to transform mediawiki.page_content_change.Jun 29 2023, 8:48 PM

JArguello-WMF removed a project: Data-Engineering-Planning.Jun 29 2023, 9:48 PM

Restricted Application added a project: Data-Engineering. · View Herald TranscriptJun 29 2023, 9:48 PM

JArguello-WMF moved this task from Incoming (new tickets) to Event Platform Backlog on the Data-Engineering board.Jun 29 2023, 10:21 PM

xcollazo mentioned this in T340861: Implement a backfill job for the dumps hourly table.Jun 30 2023, 4:03 PM

xcollazo subscribed.Jun 30 2023, 4:18 PM

JArguello-WMF added a project: Data Engineering and Event Platform Team.Jun 30 2023, 4:29 PM

JArguello-WMF moved this task from Data Eng Backlog to Event Platform Backlog on the Data Engineering and Event Platform Team board.Jun 30 2023, 4:38 PM

gmodena moved this task from Event Platform Backlog to Event Platform Maintenance (current quarter) on the Data Engineering and Event Platform Team board.Oct 23 2023, 1:26 PM

• lbowmaker removed a project: Data Engineering and Event Platform Team.Nov 10 2023, 1:40 PM

Ottomata moved this task from Backlog to Stream Processing on the Event-Platform board.Oct 25 2024, 1:31 PM