Data Lake incremental Data Updates
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	• Nuria
	Jul 21 2020, 5:28 PM

Description

Parent task to our efforts regarding incremental updates of event-based datasets

From @JAllemandou's doc (https://docs.google.com/document/d/1vahntOx4zZU5Z7Dz5JE03Co5cEcw4kAuHtb-M8-cTRE)

The historical/classical approach to load data from their original sources to data lakes is to reload the entire data in what can be named ‘full-snapshots’ at regular time intervals. The interval between reload times depends on the need for fresh data (consumer-side view), and the time it takes to fully reload the data (production-side view). The main reasons for which the full-reload approach has been widely used instead of an incremental approach is for its simplicity and easy integration with non-updatable data stores (HDFS, a widely used data lake storage is append-only for instance).
Recent years have seen changes in both consumer-side needs for data-freshness (more up-to-date data means more accurate features for machine-learning algorithms for instance), and data architecture (move from a state representation in a database to events representing business actions or database changes). New technological architecture patterns have emerged such as Change Data Capture (CDC), Command-Query-Responsiblity-Segregation (CQRS), Event Sourcing and more, usually in what can be called Event Driven Architecture.
With the rise of the events and improvement in data lakes technologies, the need for incremental data sources providing ACID-like capabilities, data mutation, schema evolution, and time-travels has become more and more prevalent and solutions have been proposed to try to solve this problem.

Related Objects
Search...

Status	Assigned	Task
Declined	None	T258511 Data Lake incremental Data Updates
Open	None	T231938 Get "edits hourly" on a daily basis
Resolved	Milimetric	T258532 [SPIKE] Prototype of incremental updates for mediawiki history for simplewiki , including reverts using apache hudi
Declined	None	T262205 Need for new event-type - `user_create` and `user_rename`
Resolved	JAllemandou	T262256 Test hudi and Iceberg as an incremental update system using 2 mediawiki-history snapshots
Declined	None	T262260 Make hudi work with Hive
Resolved	JAllemandou	T262261 Check whether mediawiki production event data is equivalent to mediawiki-history data over a month
Resolved	Milimetric	T215001 Revisions missing from mediawiki_revision_create
Resolved	None	T280538 Capture rev_is_revert event data in a stream different than mediawiki.revision-create
Declined	Milimetric	T263055 Add log entry details to page and user events in EventBus
Open	None	T240387 MW REST API Historical Data Endpoint Needs
Resolved	Milimetric	T241184 Design Document that proposes an alternative architecture for historic data endpoints
Declined	None	T241185 Flink Spike

Event Timeline

• Nuria created this task.Jul 21 2020, 5:28 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 21 2020, 5:28 PM

• Nuria renamed this task from Data ake incremental Data sources to Data Lake incremental Data Updates .Jul 21 2020, 7:52 PM

• Nuria updated the task description. (Show Details)

• Nuria added a subscriber: JAllemandou.

• Nuria added a subtask: T231938: Get "edits hourly" on a daily basis.Jul 21 2020, 7:55 PM

• Nuria added a subtask: T240387: MW REST API Historical Data Endpoint Needs.Jul 21 2020, 8:37 PM

JAllemandou updated the task description. (Show Details)Jul 22 2020, 9:59 AM

• fdans added a project: Analytics-Kanban.Aug 3 2020, 4:14 PM

• fdans moved this task from Next Up to Parent Tasks on the Analytics-Kanban board.

• fdans moved this task from Incoming to Datasets on the Analytics board.

• fdans removed a project: Analytics.

kzimmerman mentioned this in T231938: Get "edits hourly" on a daily basis.Aug 24 2020, 10:17 PM

kzimmerman added a project: Product-Analytics.

kzimmerman moved this task from Triage to Tracking on the Product-Analytics board.Aug 25 2020, 5:12 PM

• Mholloway subscribed.Apr 29 2021, 3:35 PM

Ottomata moved this task from Parent Tasks to Done on the Analytics-Kanban board.Oct 26 2021, 4:04 PM

Ottomata moved this task from Done to Parent Tasks on the Analytics-Kanban board.

odimitrijevic edited projects, added Epic, Analytics; removed Analytics-Kanban.Oct 27 2021, 10:28 PM

This is an important long term goal. I don't think that having a high level aspirational task in phabricator will help us prioritize it. I'll add to the list of large ticket items that we may wish to tackle in the next fiscal year.

Ottomata mentioned this in T314389: [SPIKE] Decide on technical solution for page state stream backfill process.Aug 18 2022, 2:18 PM

I don't think that having a high level aspirational task in phabricator will help us prioritize it

Hi @odimitrijevic! :)

Unless there are other discoverable public references to high level aspirational tasks, then others won't have a way of understanding our intentions to improve our systems. I just looked for this task but couldn't find it, and then only found it because Dan knew to link it to me. The description here, and in the linked google doc, are relevant for work being done in the Event Platform Value Streams (T314389).

The point of the task is not necessarily to prioritize, but also to document. If there is documentation elsewhere then we don't really need the task, but I don't think there is. Could we leave this (and others like it) open?

Data Lake incremental Data Updates Closed, DeclinedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Data Lake incremental Data Updates
Closed, DeclinedPublic
Actions

Related Objects
Search...