Parent task to our efforts regarding incremental updates of event-based datasets
The historical/classical approach to load data from their original sources to data lakes is to reload the entire data in what can be named ‘full-snapshots’ at regular time intervals. The interval between reload times depends on the need for fresh data (consumer-side view), and the time it takes to fully reload the data (production-side view). The main reasons for which the full-reload approach has been widely used instead of an incremental approach is for its simplicity and easy integration with non-updatable data stores (HDFS, a widely used data lake storage is append-only for instance).
Recent years have seen changes in both consumer-side needs for data-freshness (more up-to-date data means more accurate features for machine-learning algorithms for instance), and data architecture (move from a state representation in a database to events representing business actions or database changes). New technological architecture patterns have emerged such as Change Data Capture (CDC), Command-Query-Responsiblity-Segregation (CQRS), Event Sourcing and more, usually in what can be called Event Driven Architecture.
With the rise of the events and improvement in data lakes technologies, the need for incremental data sources providing ACID-like capabilities, data mutation, schema evolution, and time-travels has become more and more prevalent and solutions have been proposed to try to solve this problem.