User Story: “As a client, I want to unique articles in out snapshot dumps, so that I don't need to deal with de-duplication.”
Acceptance criteria
Write an integration test to check there there are no duplicate articles in our snapshot dump output files
ToDo
- Download snapshot dump files from our s3 bucket
- write code that looks for duplicate articles in our enterprise_html/runs/20231201/enwiktionary-NS0-20231201-ENTERPRISE-HTML.json.tar.gz
- run duplicate test on a sample of our snapshot dumps
Checklist for testing
- No duplicates in future snapshot dumps for namespace 0 or other namepsaces
Things to consider:
- Zendesk ticket: https://wikimediaenterprise.zendesk.com/agent/tickets/521
- Follow up with client on Zendesk or by email, to tell them defect is resolved