Page MenuHomePhabricator

Incremental knowledge gap dataset
Open, Needs TriagePublic

Description

The WMF data infrastructure generates monthly snapshots for content related datasets, every snapshot contains the full history of data.

The knowledge gaps pipeline depends on these snapshot data sources, and produces content gap metrics for all time at every run (i.e. "the past can change"). However, for some (most?) use cases for the content gap metrics data, we are only interested in the new month of data, and generally do not want to update older data.

Investigate and implement a way to generate an incremental dataset

  • either have second version of the snapshot based datasets, which appends only the new month of data after each pipeline run
  • run the pipeline only for the new month of data, i.e. not even compute the historical metrics (that might have changed given the new dump)
  • alternative: use the event architecture / streaming which would eliminate the need for snapshot based data

Details

Other Assignee
fkaelin

Event Timeline

fkaelin added a subscriber: XiaoXiao-WMF.

This task requires design/implementation. Given that the current implementation is stable, I am moving this task to the freezer until there is a more urgent need for an incremental dataset.

To note: The processing delay is a recurring ask/complaint. It takes ~12 days for all dependent pipelines and the knowledge gaps pipeline itself to complete, most of this delay is due to the import of the dumps in the data engineering infra (~10d). Using incremental mediawiki datasets this could be shortened to hours (or to realtime for a streaming pipeline), but this is dependent on T120242 to be completed.

cc @XiaoXiao-WMF