Page MenuHomePhabricator

Data Lake incremental Data Updates
Closed, DeclinedPublic

Description

Parent task to our efforts regarding incremental updates of event-based datasets

From @JAllemandou's doc (https://docs.google.com/document/d/1vahntOx4zZU5Z7Dz5JE03Co5cEcw4kAuHtb-M8-cTRE)

The historical/classical approach to load data from their original sources to data lakes is to reload the entire data in what can be named ‘full-snapshots’ at regular time intervals. The interval between reload times depends on the need for fresh data (consumer-side view), and the time it takes to fully reload the data (production-side view). The main reasons for which the full-reload approach has been widely used instead of an incremental approach is for its simplicity and easy integration with non-updatable data stores (HDFS, a widely used data lake storage is append-only for instance).
Recent years have seen changes in both consumer-side needs for data-freshness (more up-to-date data means more accurate features for machine-learning algorithms for instance), and data architecture (move from a state representation in a database to events representing business actions or database changes). New technological architecture patterns have emerged such as Change Data Capture (CDC), Command-Query-Responsiblity-Segregation (CQRS), Event Sourcing and more, usually in what can be called Event Driven Architecture.
With the rise of the events and improvement in data lakes technologies, the need for incremental data sources providing ACID-like capabilities, data mutation, schema evolution, and time-travels has become more and more prevalent and solutions have been proposed to try to solve this problem.

Event Timeline

Nuria renamed this task from Data ake incremental Data sources to Data Lake incremental Data Updates .Jul 21 2020, 7:52 PM
Nuria updated the task description. (Show Details)
Nuria updated the task description. (Show Details)
Nuria updated the task description. (Show Details)
Nuria added a subscriber: JAllemandou.
fdans moved this task from Next Up to Parent Tasks on the Analytics-Kanban board.
fdans moved this task from Incoming to Datasets on the Analytics board.
fdans removed a project: Analytics.
odimitrijevic subscribed.

This is an important long term goal. I don't think that having a high level aspirational task in phabricator will help us prioritize it. I'll add to the list of large ticket items that we may wish to tackle in the next fiscal year.

I don't think that having a high level aspirational task in phabricator will help us prioritize it

Hi @odimitrijevic! :)

Unless there are other discoverable public references to high level aspirational tasks, then others won't have a way of understanding our intentions to improve our systems. I just looked for this task but couldn't find it, and then only found it because Dan knew to link it to me. The description here, and in the linked google doc, are relevant for work being done in the Event Platform Value Streams (T314389).

The point of the task is not necessarily to prioritize, but also to document. If there is documentation elsewhere then we don't really need the task, but I don't think there is. Could we leave this (and others like it) open?