Page MenuHomePhabricator

{Investigation} Different file sizes for dumps
Closed, ResolvedPublicBUG REPORT

Description

Looks like we are missing data in our eowiki namespace 0 dumps, we need to figure out the root cause. More information can be found here: https://meta.wikimedia.org/wiki/Talk:Wikimedia_Enterprise#Esperanto_(eowiki-NS0)_and_Aragonese_(anwiki-NS0)_Wikipedia_problem.
For the context: our dumps are mirrored to https://dumps.wikimedia.org/ twice a month, they can be found here https://dumps.wikimedia.org/other/enterprise_html/runs/.

Acceptance criteria

*Figure out the root cause
*Create a ticket for the solution (if the root cause was identified)
*Communicate the findings back to the Talk page

Developer Notes

  • Same issue showing up in enwiktionary:

file sizes from the most recent enwiktionary HTML dumps (NS0):

20230701: 13 GB
20230720: 7.1 GB
20230801: 1.1 GB
20230820: 4.6 GB
20230901: 7.2 GB
20230920: 3 GB
20231001: 5 GB
20231020: 2.9 GB
20231101: 3.0 GB
20231120: 3.2 GB
20231201: 3.5 GB
20231220: 3.8 GB
20240120: 9.6 GB
20240201: 9.6 GB
20240220: 9.6 GB
20240301: 9.6 GB
20240320: 10.0 GB
20240520: 10.7 GB
20240601: 14.5 GB 📈

something's going really wrong there.

Event Timeline

Weirdly, there seems to be less variation in filesizes for Wikipedia dumps:

wikipedia

wikipedia_sizes.png (427×556 px, 53 KB)

wiktionary
wiktionary_sizes.png (427×557 px, 67 KB)

wikisource
enwikisource.png (427×557 px, 70 KB)

wikivoyage
enwikivoyage.png (427×551 px, 68 KB)

Any idea why this would affect primarily non-wikipedia instances? Is the code which generates these dumps available somewhere?

More suspicious file sizes:
19G /mnt/nfs/dumps-clouddumps1001.wikimedia.org/other/enterprise_html/runs/20230720/dewiki-NS0-20230720-ENTERPRISE-HTML.json.tar.gz
32G /mnt/nfs/dumps-clouddumps1001.wikimedia.org/other/enterprise_html/runs/20230920/dewiki-NS0-20230920-ENTERPRISE-HTML.json.tar.gz

The 2023-07-20 file seems to have been 700k or so rows short. The 2023-09-20 file seems to be truncated, I keep running into a malformed stream error while processing.

Some random guessing: perhaps the error handling code is borked, and it just finishes the dump and closes the file (without erroring the process)? But why then would so many repositories hit errors at the same time? All the 7-20 dumps seem to be affected, maybe some site-wide network/server problems which weren't handled properly?

Is there anything going to be done about this? The enterprise dumps have been in full failure mode for a few months now and are absolutely unusable. I really don't know how an obvious total failure of service can stay in triage hell for such a long time. I understand WMF resources are limited, but then at least let volunteers help out with this. My question about the code generating the dumps above is still unanswered. The transparency/communication on this whole issue has been miserable.

If there's no will to maintain usable dumps from the WMF side the community will have to build alternative systems.

JArguello-WMF changed the task status from Open to In Progress.Nov 7 2023, 2:23 PM

Hello @jberkel! Thanks for your feedback. We understand the frustration that can arise from delayed responses, and please know that your concerns have not gone unnoticed. Our team is fully aware of the impact this delay has had, and we are committed to rectifying the situation as promptly as possible.

While we cannot guarantee an immediate resolution, I want to assure you that the matter is currently at the top of our agenda. We have marked it as an 'expedited' topic to be tackled with the utmost priority. We appreciate your understanding and patience as we work on the ticket.

Thank you for your continued interest in using these database dumps.

Hello,

The team continues to work on this issue, we have detected autoscaling issues that we have addressed and continue to dig deeper into other potential causes, after a root cause analysis.

We will post more updates as we go along with the research.

Thank you

Hello,

We have made a change in the last 2 weeks and are analysing the results in order to figure out if there's less discrepancies, if you find any please let us know.

We also continue to look into improvements of our snapshot process.

Thank you

@REsquito-WMF not sure if the changes were already in place, but the current enwiktionary NS0 dump is still at 3.5 GB (compared to 13 GB on 20230701).

@jberkel Happy new year.

We returned to work and we made a change in a configuration, it should be updated tomorrow.

Thank you.

Hi,

The change we made had a great impact:

Before: 1290827 total pages

Current: 5812947 total pages

Expected: 7921988 total pages

the missing pages are related to a bug we are tracking here: https://phabricator.wikimedia.org/T351712

In that sense, we are going to be tracking the rest of the work there

@REsquito-WMF thanks! So this means the next dumps will have more data, but will still be incomplete until this other bug is fixed?

OK. I think it might be worth putting a disclaimer somewhere, perhaps on https://dumps.wikimedia.org/other/enterprise_html/, to warn users that the dumps are incomplete.

Latest enwikt dump is now at 9.6 GB, still some way to go to the 13GB of the 20230701 dump (also incomplete, but still useful as a baseline).

I'm wondering what's the deal with the Closed as Unknown Status here, haven't seen this before and I'm unsure about its meaning.

Aklapper changed the task status from Unknown Status to Resolved.Mar 18 2024, 10:40 AM

So this has been resolved? 13GB of the 20230701 dump was so large why? Because it contained duplicate documents? Otherwise it is unclear why it is just 9.6 GB now.

It probably means the investigation has been "resolved". The main task is now T351712 + subtasks.

Can anyone clarify though? It seems that the new sub-tasks are now stuck again.

This last change seems reasonable? Size is increasing now?

It's probably just the new content, with the baseline still being incomplete. I'll check with the XML dumps.

Latest HTML enwikt dump (20240520) vs XML dump:

  • 1.883.645 pages missing completely from the HTML dump
  • 4843 pages out of date (present in HTML dump but not matching the XML revision id)

Histogram of the time skew, in months:

[  0.0,   2.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2429
[  2.0,   4.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇ 954
[  4.0,   6.0) ┤▇▇▇ 224
[  6.0,   8.0) ┤▇▇ 131
[  8.0,  10.0) ┤▇▇▇ 224
[ 10.0,  12.0) ┤▇ 71
[ 12.0,  14.0) ┤ 7
[ 14.0,  16.0) ┤ 17
[ 16.0,  18.0) ┤▇ 86
[ 18.0,  20.0) ┤ 32
[ 20.0,  22.0) ┤▇ 46
[ 22.0,  24.0) ┤▇ 39

Basically this means almost 25% of the data is missing or outdated, and the outdated revisions often contain vandalism. I'm not sure why these dumps are even produced, it's a total waste of time, bandwidth, and CPU cycles.

Hi All,

Quick update, we have finished the work.

Latest enwiktionary file is now 15Gb.

Wait for the next public dump coming, or get it from enterprise.

We did a few tests and duplicates, deleted and missed are all below 1% at the time of testing.

Thanks

That's good news. I've done some tests, and it's looking much better now. The XML dumps haven't been released yet (due to T365501), so there's no baseline to do more detailed testing.

Done some testing with the latest (20240701) dumps (allowing for some tolerance around the moment of dump generation):

  • 57 pages missing from the HTML dump
  • 176 pages out-of-date (present, but not matching the XML revision id)

That means an error rate of ~ 0.003% which seems acceptable :)

It might still be worth investigating why some pages are completely missing, I've added them for reference:

rev_idtitletimestampdiff_dump
051803071forbyland2019-03-14 06:36:451936
175666470klampas2023-08-17 15:46:34319
278439452和製漢字2024-03-12 05:44:29111
379073720kösemenler2024-04-30 18:22:5062
479087259निन्द्2024-05-02 18:54:5360
579081549जनसंख़्या-नियंत्रण2024-05-02 02:19:4260
679084663assiétes2024-05-02 11:24:2460
779088128lüleciler2024-05-02 21:08:1760
879090819مەڕمەڕ2024-05-03 12:20:5559
979093002antitaurin2024-05-03 19:40:5259
1079097505călcǫn'2024-05-04 13:13:3058
1179097554течност за чистачки2024-05-04 13:25:0058
1279097580añda2024-05-04 13:33:0658
1379097576chemigroundwood2024-05-04 13:32:3658
1479097577chemigroundwoods2024-05-04 13:32:4058
1579097510búa và liềm2024-05-04 13:14:0858
1679097559畫虎類犬2024-05-04 13:26:4858
1779244586pleber2024-05-14 18:16:0148
1879291102imouto2024-05-17 17:09:0645
1979293554fundamentalize2024-05-18 00:44:2944
2079428088సేంది2024-05-26 06:54:5936
2179435962凶巴巴2024-05-27 11:41:1235
2279447028вусям2024-05-28 13:17:0134
2379487995ウオッカ2024-06-02 02:48:1829
2480016538heaþolac2024-06-02 22:18:5429
2579817322口づける2024-06-02 16:21:0029
2680105696abrebiyado2024-06-03 04:38:2228
2780129282цаъᵸ2024-06-04 13:51:4127
2880129248наъцә2024-06-04 13:50:0727
2980129264раъкӏу2024-06-04 13:50:5527
3080129206маъкъу2024-06-04 13:47:3427
3180129103гьаъд2024-06-04 13:42:3227
3280129156йаъᵸлӏу2024-06-04 13:45:2827
3380158135see-you-next-Tuesday2024-06-07 03:56:3024
3480195798पियार2024-06-09 20:57:5822
3580205825굴 소스2024-06-11 04:00:3420
3680220040mulah2024-06-12 16:51:2219
3780258920heard of2024-06-16 13:56:5515
3880268142Hrofesceastre2024-06-17 18:35:3714
3980262671پس فردا2024-06-17 01:30:4114
4080268804ഞാമ2024-06-17 20:31:4114
41802681982024-06-17 18:41:2614
4280457314בעל־אַבֿדה2024-06-19 02:59:4612
4380465717handgags2024-06-20 13:24:1711
4480476402magka-diabetes2024-06-21 14:12:2210
4580476167mag-apply2024-06-21 13:04:0610
4680484440êcovar2024-06-22 11:19:179
4780495348بیدۆک بابا2024-06-23 20:52:568
4880495598vac.2024-06-23 21:26:218
4980511181Austria-Unggriya2024-06-25 13:35:106
5080514307süsu2024-06-25 22:57:576
5180520575pabílu2024-06-26 22:42:535
5280523268busted my neck2024-06-27 04:16:074
5380522837અનાડી2024-06-27 02:51:564
5480540117¸2024-06-28 21:38:353
5580546421mehank2024-06-29 07:15:172
5680548677Kalipornya2024-06-29 16:13:312

In some cases it looks like they had been redirects that were converted into content pages.

Thanks @jberkel

Do you want to open a bug with this info?

If not we'll open it and have a look.

Thanks