Page MenuHomePhabricator

{Investigation} Different file sizes for dumps
Closed, ResolvedPublicBUG REPORT

Description

Looks like we are missing data in our eowiki namespace 0 dumps, we need to figure out the root cause. More information can be found here: https://meta.wikimedia.org/wiki/Talk:Wikimedia_Enterprise#Esperanto_(eowiki-NS0)_and_Aragonese_(anwiki-NS0)_Wikipedia_problem.
For the context: our dumps are mirrored to https://dumps.wikimedia.org/ twice a month, they can be found here https://dumps.wikimedia.org/other/enterprise_html/runs/.

Acceptance criteria

*Figure out the root cause
*Create a ticket for the solution (if the root cause was identified)
*Communicate the findings back to the Talk page

Developer Notes

  • Same issue showing up in enwiktionary:

file sizes from the most recent enwiktionary HTML dumps (NS0):

20230701: 13 GB
20230720: 7.1 GB
20230801: 1.1 GB
20230820: 4.6 GB
20230901: 7.2 GB
20230920: 3 GB
20231001: 5 GB
20231020: 2.9 GB
20231101: 3.0 GB
20231120: 3.2 GB
20231201: 3.5 GB
20231220: 3.8 GB
20240120: 9.6 GB
20240201: 9.6 GB
20240220: 9.6 GB
20240301: 9.6 GB
20240320: 10.0 GB

something's going really wrong there.

Event Timeline

Weirdly, there seems to be less variation in filesizes for Wikipedia dumps:

wikipedia

wikipedia_sizes.png (427×556 px, 53 KB)

wiktionary
wiktionary_sizes.png (427×557 px, 67 KB)

wikisource
enwikisource.png (427×557 px, 70 KB)

wikivoyage
enwikivoyage.png (427×551 px, 68 KB)

Any idea why this would affect primarily non-wikipedia instances? Is the code which generates these dumps available somewhere?

More suspicious file sizes:
19G /mnt/nfs/dumps-clouddumps1001.wikimedia.org/other/enterprise_html/runs/20230720/dewiki-NS0-20230720-ENTERPRISE-HTML.json.tar.gz
32G /mnt/nfs/dumps-clouddumps1001.wikimedia.org/other/enterprise_html/runs/20230920/dewiki-NS0-20230920-ENTERPRISE-HTML.json.tar.gz

The 2023-07-20 file seems to have been 700k or so rows short. The 2023-09-20 file seems to be truncated, I keep running into a malformed stream error while processing.

Some random guessing: perhaps the error handling code is borked, and it just finishes the dump and closes the file (without erroring the process)? But why then would so many repositories hit errors at the same time? All the 7-20 dumps seem to be affected, maybe some site-wide network/server problems which weren't handled properly?

Is there anything going to be done about this? The enterprise dumps have been in full failure mode for a few months now and are absolutely unusable. I really don't know how an obvious total failure of service can stay in triage hell for such a long time. I understand WMF resources are limited, but then at least let volunteers help out with this. My question about the code generating the dumps above is still unanswered. The transparency/communication on this whole issue has been miserable.

If there's no will to maintain usable dumps from the WMF side the community will have to build alternative systems.

JArguello-WMF changed the task status from Open to In Progress.Nov 7 2023, 2:23 PM

Hello @jberkel! Thanks for your feedback. We understand the frustration that can arise from delayed responses, and please know that your concerns have not gone unnoticed. Our team is fully aware of the impact this delay has had, and we are committed to rectifying the situation as promptly as possible.

While we cannot guarantee an immediate resolution, I want to assure you that the matter is currently at the top of our agenda. We have marked it as an 'expedited' topic to be tackled with the utmost priority. We appreciate your understanding and patience as we work on the ticket.

Thank you for your continued interest in using these database dumps.

Hello,

The team continues to work on this issue, we have detected autoscaling issues that we have addressed and continue to dig deeper into other potential causes, after a root cause analysis.

We will post more updates as we go along with the research.

Thank you

Hello,

We have made a change in the last 2 weeks and are analysing the results in order to figure out if there's less discrepancies, if you find any please let us know.

We also continue to look into improvements of our snapshot process.

Thank you

@REsquito-WMF not sure if the changes were already in place, but the current enwiktionary NS0 dump is still at 3.5 GB (compared to 13 GB on 20230701).

@jberkel Happy new year.

We returned to work and we made a change in a configuration, it should be updated tomorrow.

Thank you.

Hi,

The change we made had a great impact:

Before: 1290827 total pages

Current: 5812947 total pages

Expected: 7921988 total pages

the missing pages are related to a bug we are tracking here: https://phabricator.wikimedia.org/T351712

In that sense, we are going to be tracking the rest of the work there

@REsquito-WMF thanks! So this means the next dumps will have more data, but will still be incomplete until this other bug is fixed?

OK. I think it might be worth putting a disclaimer somewhere, perhaps on https://dumps.wikimedia.org/other/enterprise_html/, to warn users that the dumps are incomplete.

Latest enwikt dump is now at 9.6 GB, still some way to go to the 13GB of the 20230701 dump (also incomplete, but still useful as a baseline).

I'm wondering what's the deal with the Closed as Unknown Status here, haven't seen this before and I'm unsure about its meaning.

Aklapper changed the task status from Unknown Status to Resolved.Mar 18 2024, 10:40 AM

So this has been resolved? 13GB of the 20230701 dump was so large why? Because it contained duplicate documents? Otherwise it is unclear why it is just 9.6 GB now.

It probably means the investigation has been "resolved". The main task is now T351712 + subtasks.

Can anyone clarify though? It seems that the new sub-tasks are now stuck again.