Page MenuHomePhabricator

Issues in the dumps → mediawiki wikitext history → content gap metrics pipeline can significantly delay the movement metrics report
Open, MediumPublic

Description

This task is primarily intended for documenting how Movement-Metrics is affected by problems with Dumps-Generation and mediawiki_wikitext_history. For that reason, it is not tagged with Data-Platform or Dumps-Generation.

SDS 2.6.2 (FY2023-24) has been focused on improving the delivery of the movement metrics report. Our critical path is as follows:

Before T357859, the average duration was 26 days. Afterward, the average duration has been 17 days.

data intervaldays to availability of knowledge gapsnotes
2023-0923.11 day delay due to T342911
2023-1025.5
2023-1127.94 day delay due to T342911
2023-1226.12 day delay due to T342911
2024-0126.91 day delay due to T342911, knowledge gaps job issue (T358613)
2024-0210.6First run skipping Wikidata to save time (T357859), 1 day delay due to T342911
2024-0318.7Dumps generation issue, ultimately resolved by skipping Commons (T362454), 1 day delay to T342911
2024-0414.1Dumps generation issue (T364391)
2024-0523.8Major dumps generation issue (T365155)
2024-0618.0Wikidata JSON dumps issue (T370050)
2024-0718.0Wikidata JSON dumps were disabled on June 24th T368098#9919385 and not re-enabled until July 10th T368098#9969417
2024-0826.0Delays in the mediawiki_wikitext_history airflow job : the issue comes from labswiki (wikitech) project added to the sqoop list but not generated as a dump T217792#10172352

(raw data in spreadsheet)

Event Timeline

nshahquinn-wmf renamed this task from Dumps and mediawiki_wikitext_history issues can significantly delay the movement metrics report to Issues in the dumps → mediawiki wikitext history → content gap metrics pipeline can significantly delay the movement metrics report.May 20 2024, 8:08 PM
nshahquinn-wmf updated the task description. (Show Details)
OSefu-WMF triaged this task as Medium priority.Jul 10 2024, 6:51 PM
OSefu-WMF moved this task from Incoming to Waiting on others on the Movement-Insights board.

@Hghani and @Mayakp.wiki - Can you add to the table above any delays that you experienced with the delivery of the content gaps metrics in July and Aug metric runs? I want to keep a full history of issues as I work to address this with data eng/platform.

@OSefu-WMF : i updated the table and added the notes from our Slack threads on the knowledge gap delay.
FYI, I've not updated the spreadsheet.

Expect possible delays in January 2025 content gap metrics calculation, due to T368098.

mediawiki_wikitext_history doesn't exist anymore. Should we close this?