Page MenuHomePhabricator

Data Lake incremental Data Updates
Open, Needs TriagePublic

Description

Parent task to our efforts regarding incremental updates of event-based datasets

From @JAllemandou's doc (https://docs.google.com/document/d/1vahntOx4zZU5Z7Dz5JE03Co5cEcw4kAuHtb-M8-cTRE)

The historical/classical approach to load data from their original sources to data lakes is to reload the entire data in what can be named ‘full-snapshots’ at regular time intervals. The interval between reload times depends on the need for fresh data (consumer-side view), and the time it takes to fully reload the data (production-side view). The main reasons for which the full-reload approach has been widely used instead of an incremental approach is for its simplicity and easy integration with non-updatable data stores (HDFS, a widely used data lake storage is append-only for instance).
Recent years have seen changes in both consumer-side needs for data-freshness (more up-to-date data means more accurate features for machine-learning algorithms for instance), and data architecture (move from a state representation in a database to events representing business actions or database changes). New technological architecture patterns have emerged such as Change Data Capture (CDC), Command-Query-Responsiblity-Segregation (CQRS), Event Sourcing and more, usually in what can be called Event Driven Architecture.
With the rise of the events and improvement in data lakes technologies, the need for incremental data sources providing ACID-like capabilities, data mutation, schema evolution, and time-travels has become more and more prevalent and solutions have been proposed to try to solve this problem.

Event Timeline

Nuria created this task.Jul 21 2020, 5:28 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 21 2020, 5:28 PM
Nuria renamed this task from Data ake incremental Data sources to Data Lake incremental Data Updates .Jul 21 2020, 7:52 PM
Nuria updated the task description. (Show Details)
Nuria updated the task description. (Show Details)
Nuria updated the task description. (Show Details)
Nuria added a subscriber: JAllemandou.
JAllemandou updated the task description. (Show Details)Jul 22 2020, 9:59 AM
fdans moved this task from Next Up to Parent Tasks on the Analytics-Kanban board.
fdans moved this task from Incoming to Datasets on the Analytics board.
fdans removed a project: Analytics.
kzimmerman moved this task from Triage to Tracking on the Product-Analytics board.Aug 25 2020, 5:12 PM