The data generated by the pipeline for content gap metrics of the knowledge gaps project experienced two separate issues:
- a problem with the wikitext history source caused the 2023-05 run of the knowledge gap pipeline to ingest duplicate data. T342911
- the knowledge gap pipeline itself was misconfigured and used an outdated input dataset
Background
This investigation started when @Mayakp.wiki noticed that the standard quality metrics for the "wmf region" content gap seem wrong.
An article meets the standard quality threshold based on a set of heuristics like number of images/links/references/etc. These features are derived from the wikitext for each revision in history. The standard quality dataset, along with others like # of pageviews/revisions/etc, is used to compute the metrics for each content gap.
The data engineering pipeline generating the wikitext produced 4 copies of the data for May 2023, which caused the article feature pipeline to fail (likely due to the size of 44TB). When rerunning this months manually, it ingested duplicated data which in turn lead to incorrect metrics. After debugging and discovering the secondary issue, the pipeline is currently in the process of re-computing up-to-date content gap metrics.