Incremental knowledge gap dataset
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	fkaelin
	Aug 23 2023, 3:37 PM

Description

The WMF data infrastructure generates monthly snapshots for content related datasets, every snapshot contains the full history of data.

The knowledge gaps pipeline depends on these snapshot data sources, and produces content gap metrics for all time at every run (i.e. "the past can change"). However, for some (most?) use cases for the content gap metrics data, we are only interested in the new month of data, and generally do not want to update older data.

Investigate and implement a way to generate an incremental dataset

either have second version of the snapshot based datasets, which appends only the new month of data after each pipeline run
run the pipeline only for the new month of data, i.e. not even compute the historical metrics (that might have changed given the new dump)
alternative: use the event architecture / streaming which would eliminate the need for snapshot based data

Details

Other Assignee: fkaelin

Related Objects

Mentioned Here: T120242: Eventually Consistent MediaWiki State Change Events

Event Timeline

fkaelin created this task.Aug 23 2023, 3:37 PM

This task requires design/implementation. Given that the current implementation is stable, I am moving this task to the freezer until there is a more urgent need for an incremental dataset.

To note: The processing delay is a recurring ask/complaint. It takes ~12 days for all dependent pipelines and the knowledge gaps pipeline itself to complete, most of this delay is due to the import of the dumps in the data engineering infra (~10d). Using incremental mediawiki datasets this could be shortened to hours (or to realtime for a streaming pipeline), but this is dependent on T120242 to be completed.

cc @XiaoXiao-WMF

Incremental knowledge gap datasetOpen, Needs TriagePublicActions

Description

Details

Related Objects

Event Timeline

Incremental knowledge gap dataset
Open, Needs TriagePublic
Actions