Change Details

Parent task to our efforts regarding incremental updates of event-based datasets From @JAllemandou's doc (https://docs.google.com/document/d/1vahntOx4zZU5Z7Dz5JE03Co5cEcw4kAuHtb-M8-cTRE) The historical/classical approach to load data from their original sources to data lakes is to reload the entire data in what can be named ‘full-snapshots’ at regular time intervals. The interval between reload times depends on the need for fresh data (consumer-side view), and the time it takes to fully reload the data (production-side view). The main reasons for which the full-reload approach has been widely used instead of an incremental approach is for its simplicity and easy integration with non-updatable data stores (HDFS, a widely used data lake storage is append-only for instance). Recent years have seen changes in both consumer-side needs for data-freshness (more up-to-date data means more accurate features for machine-learning algorithms for instance), and data architecture (move from a state representation in a database to events representing business actions or database changes). New technological architecture patterns have emerged such as Change Data Capture (CDC), Command-Query-Responsiblity-Segregation (CQRS), Event Sourcing and more, usually in what can be called Event Driven Architecture. With the `//rise of the events`// and improvement in data lakes technologies, the need for incremental data sources providing ACID-like capabilities, data mutation, schema evolution, and time-travels has become more and more prevalent and solutions have been proposed to try to solve this problem.