The WMF data infrastructure generates monthly snapshots for content related datasets, every snapshot contains the full history of data.
The knowledge gaps pipeline depends on these snapshot data sources, and produces content gap metrics for all time at every run (i.e. "the past can change"). However, for some (most?) use cases for the content gap metrics data, we are only interested in the new month of data, and generally do not want to update older data.
Investigate and implement a way to generate an incremental dataset
- either have second version of the snapshot based datasets, which appends only the new month of data after each pipeline run
- run the pipeline only for the new month of data, i.e. not even compute the historical metrics (that might have changed given the new dump)
- alternative: use the event architecture / streaming which would eliminate the need for snapshot based data