Page MenuHomePhabricator

{Investigation} Different file sizes for dumps
Open, In Progress, HighPublicBUG REPORT

Description

Looks like we are missing data in our eowiki namespace 0 dumps, we need to figure out the root cause. More information can be found here: https://meta.wikimedia.org/wiki/Talk:Wikimedia_Enterprise#Esperanto_(eowiki-NS0)_and_Aragonese_(anwiki-NS0)_Wikipedia_problem.
For the context: our dumps are mirrored to https://dumps.wikimedia.org/ twice a month, they can be found here https://dumps.wikimedia.org/other/enterprise_html/runs/.

Acceptance criteria

*Figure out the root cause
*Create a ticket for the solution (if the root cause was identified)
*Communicate the findings back to the Talk page

Developer Notes

  • Same issue showing up in enwiki:

file sizes from the most recent enwikt HTML dumps (NS0):

20230701: 13 GB
20230720: 7.1 GB
20230801: 1.1 GB
20230820: 4.6 GB
20230901: 7.2 GB
20230920: 3 GB
20231001: 5 GB
20231020: 2.9 GB
20231101: 3.0 GB

something's going really wrong there.

Event Timeline

Weirdly, there seems to be less variation in filesizes for Wikipedia dumps:

wikipedia

wikipedia_sizes.png (427×556 px, 53 KB)

wiktionary
wiktionary_sizes.png (427×557 px, 67 KB)

wikisource
enwikisource.png (427×557 px, 70 KB)

wikivoyage
enwikivoyage.png (427×551 px, 68 KB)

Any idea why this would affect primarily non-wikipedia instances? Is the code which generates these dumps available somewhere?

More suspicious file sizes:
19G /mnt/nfs/dumps-clouddumps1001.wikimedia.org/other/enterprise_html/runs/20230720/dewiki-NS0-20230720-ENTERPRISE-HTML.json.tar.gz
32G /mnt/nfs/dumps-clouddumps1001.wikimedia.org/other/enterprise_html/runs/20230920/dewiki-NS0-20230920-ENTERPRISE-HTML.json.tar.gz

The 2023-07-20 file seems to have been 700k or so rows short. The 2023-09-20 file seems to be truncated, I keep running into a malformed stream error while processing.

Some random guessing: perhaps the error handling code is borked, and it just finishes the dump and closes the file (without erroring the process)? But why then would so many repositories hit errors at the same time? All the 7-20 dumps seem to be affected, maybe some site-wide network/server problems which weren't handled properly?

Is there anything going to be done about this? The enterprise dumps have been in full failure mode for a few months now and are absolutely unusable. I really don't know how an obvious total failure of service can stay in triage hell for such a long time. I understand WMF resources are limited, but then at least let volunteers help out with this. My question about the code generating the dumps above is still unanswered. The transparency/communication on this whole issue has been miserable.

If there's no will to maintain usable dumps from the WMF side the community will have to build alternative systems.

JArguello-WMF changed the task status from Open to In Progress.Tue, Nov 7, 2:23 PM

Hello @jberkel! Thanks for your feedback. We understand the frustration that can arise from delayed responses, and please know that your concerns have not gone unnoticed. Our team is fully aware of the impact this delay has had, and we are committed to rectifying the situation as promptly as possible.

While we cannot guarantee an immediate resolution, I want to assure you that the matter is currently at the top of our agenda. We have marked it as an 'expedited' topic to be tackled with the utmost priority. We appreciate your understanding and patience as we work on the ticket.

Thank you for your continued interest in using these database dumps.

Hello,

The team continues to work on this issue, we have detected autoscaling issues that we have addressed and continue to dig deeper into other potential causes, after a root cause analysis.

We will post more updates as we go along with the research.

Thank you