Page MenuHomePhabricator

Duplicate articles in snapshot dump
Open, HighPublic

Description

User Story: “As a client, I want to unique articles in out snapshot dumps, so that I don't need to deal with de-duplication.”

Acceptance criteria

Write an integration test to check there there are no duplicate articles in our snapshot dump output files

ToDo

  • Download snapshot dump files from our s3 bucket
  • write code that looks for duplicate articles in our enterprise_html/runs/20231201/enwiktionary-NS0-20231201-ENTERPRISE-HTML.json.tar.gz
  • run duplicate test on a sample of our snapshot dumps

Checklist for testing

  • No duplicates in future snapshot dumps for namespace 0 or other namepsaces
Things to consider:

Event Timeline

ROdonnell-WMF created this task.

Please test for duplicates by page_id as well as revision_id. I've found duplicate revisions in the dumps and also multiple revisions of a single page.

This may be related to T362894: Data quality: HTML dumps contain unexplainably outdated revisions of some pages. The duplicates seem to have various revision ids, here's a set showing that the article is included three times with the same title and page id, but at different versions:

tar xzf dewiki-NS0-20240201-ENTERPRISE-HTML.json.tar.gz -O | jq 'select(.name == "10.000 B.C.") | .identifier,.version.identifier'
page idrevision id
3394140234268857
3394140241670834
3394140241670882