Page MenuHomePhabricator

Content gap metrics data issue
Closed, ResolvedPublic

Description

The data generated by the pipeline for content gap metrics of the knowledge gaps project experienced two separate issues:

  • a problem with the wikitext history source caused the 2023-05 run of the knowledge gap pipeline to ingest duplicate data. T342911
  • the knowledge gap pipeline itself was misconfigured and used an outdated input dataset

Background

This investigation started when @Mayakp.wiki noticed that the standard quality metrics for the "wmf region" content gap seem wrong.

An article meets the standard quality threshold based on a set of heuristics like number of images/links/references/etc. These features are derived from the wikitext for each revision in history. The standard quality dataset, along with others like # of pageviews/revisions/etc, is used to compute the metrics for each content gap.

The data engineering pipeline generating the wikitext produced 4 copies of the data for May 2023, which caused the article feature pipeline to fail (likely due to the size of 44TB). When rerunning this months manually, it ingested duplicated data which in turn lead to incorrect metrics. After debugging and discovering the secondary issue, the pipeline is currently in the process of re-computing up-to-date content gap metrics.

Details

Due Date
Aug 8 2023, 4:00 AM

Event Timeline

fkaelin triaged this task as High priority.Jul 29 2023, 4:42 AM
fkaelin created this task.

The data was generated successfully, and it looks right. The hive table knowledge_gaps contains up-to-date data.

The csv files from the analytics are not available yet, they should show up eventually. The csv files are stored on hdfs first, e.g. /wmf/data/published/datasets/knowledge_gaps/content_gaps/knowledge_gap_index_metrics_csv/content_gap=geography_wmf_region, so you can download from there too if needed. If they are not present on the web download next week, we should ask data engineering why these files for not synced.

Leaving this open pending further verification.

@Mayakp.wiki Following up regarding the change in metrics for a given month, when generated in subsequent monthly pipeline runs. E.g. A metric value for month 2023-05 calculated in June can be different than the same metric value calculated in July.

As wikidata data is being updated, it increases the % of articles associated with a content gap (or decreases, in the case of corrections).
Example: an existing article should associated with a certain country, but wikidata is missing this relation. If that is fixed in a given month, that article is included in the aggregation of the metrics timeseries, thus the numbers in prior months can change as that articles metrics now count toward e.g. the country geography gap.

When integrating content gap metrics in wikistats, the decision was made to not modify the past; i.e. by only appending the new metrics of the current months. For Movement Insight's use case of this data, this would seem like a good approach for top-level metrics / trends too. In effect this would mean that given a new monthly content gap metrics dataset, you would only need the metrics for that given month (e.g. for the July run, you would copy only the 2023-06 rows to the spreadsheet).

Do you think the fact that the "past" can change the numbers is of interest for any metrics? For example for wikidata initiatives that aim to improve coverage for underrepresented articles, it could be a way to represent progress (e.g. "our efforts lead to a X% change in # of edits, as in edits that should have counted towards a given content gap but were missed previously").

Closing this as resolved, as the data was been manually corrected and subsequent issues seem unrelated.