{Investigation} Different file sizes for dumps
Closed, ResolvedPublicBUG REPORT
Actions

Assigned To

Authored By

	REsquito-WMF
	Aug 29 2023, 3:02 PM

Description

Looks like we are missing data in our eowiki namespace 0 dumps, we need to figure out the root cause. More information can be found here: https://meta.wikimedia.org/wiki/Talk:Wikimedia_Enterprise#Esperanto_(eowiki-NS0)_and_Aragonese_(anwiki-NS0)_Wikipedia_problem.
For the context: our dumps are mirrored to https://dumps.wikimedia.org/ twice a month, they can be found here https://dumps.wikimedia.org/other/enterprise_html/runs/.

Acceptance criteria

*Figure out the root cause
*Create a ticket for the solution (if the root cause was identified)
*Communicate the findings back to the Talk page

Developer Notes

Same issue showing up in enwiktionary:

file sizes from the most recent enwiktionary HTML dumps (NS0):

20230701: 13 GB
20230720: 7.1 GB
20230801: 1.1 GB
20230820: 4.6 GB
20230901: 7.2 GB
20230920: 3 GB
20231001: 5 GB
20231020: 2.9 GB
20231101: 3.0 GB
20231120: 3.2 GB
20231201: 3.5 GB
20231220: 3.8 GB
20240120: 9.6 GB
20240201: 9.6 GB
20240220: 9.6 GB
20240301: 9.6 GB
20240320: 10.0 GB

something's going really wrong there.

Related Objects
Search...

Status	Subtype	Assigned	Task
Open	BUG REPORT	None	T305407 Stale data / missing pages in HTML ("enterprise")
Open		None	T353771 Calculate baseline reference metrics
Resolved	BUG REPORT	REsquito-WMF	T345176 {Investigation} Different file sizes for dumps
Open		None	T348100 Request: changelog for Enterprise API HTML dumps

Event Timeline

REsquito-WMF created this task.Aug 29 2023, 3:02 PM

REsquito-WMF moved this task from Incoming to To Be Estimated/To Be Discussed on the Wikimedia Enterprise board.Aug 29 2023, 3:05 PM

REsquito-WMF updated the task description. (Show Details)Aug 29 2023, 3:10 PM

ArielGlenn moved this task from Backlog to Other teams on the Dumps-Generation board.Aug 31 2023, 10:01 AM

Weirdly, there seems to be less variation in filesizes for Wikipedia dumps:

wikipedia

wiktionary

wikisource

wikivoyage

jberkel updated the task description. (Show Details)Sep 20 2023, 3:07 PM

Any idea why this would affect primarily non-wikipedia instances? Is the code which generates these dumps available somewhere?

awight added a subtask: T348100: Request: changelog for Enterprise API HTML dumps.Oct 4 2023, 6:30 AM

jberkel updated the task description. (Show Details)Oct 4 2023, 8:04 AM

More suspicious file sizes:
19G /mnt/nfs/dumps-clouddumps1001.wikimedia.org/other/enterprise_html/runs/20230720/dewiki-NS0-20230720-ENTERPRISE-HTML.json.tar.gz
32G /mnt/nfs/dumps-clouddumps1001.wikimedia.org/other/enterprise_html/runs/20230920/dewiki-NS0-20230920-ENTERPRISE-HTML.json.tar.gz

The 2023-07-20 file seems to have been 700k or so rows short. ~~The 2023-09-20 file seems to be truncated, I keep running into a malformed stream error while processing.~~

Tobi_WMDE_SW subscribed.Oct 5 2023, 11:09 AM

Some random guessing: perhaps the error handling code is borked, and it just finishes the dump and closes the file (without erroring the process)? But why then would so many repositories hit errors at the same time? All the 7-20 dumps seem to be affected, maybe some site-wide network/server problems which weren't handled properly?

jberkel updated the task description. (Show Details)Oct 20 2023, 3:00 PM

jberkel updated the task description. (Show Details)Nov 6 2023, 9:09 AM

Is there anything going to be done about this? The enterprise dumps have been in full failure mode for a few months now and are absolutely unusable. I really don't know how an obvious total failure of service can stay in triage hell for such a long time. I understand WMF resources are limited, but then at least let volunteers help out with this. My question about the code generating the dumps above is still unanswered. The transparency/communication on this whole issue has been miserable.

If there's no will to maintain usable dumps from the WMF side the community will have to build alternative systems.

REsquito-WMF claimed this task.Nov 7 2023, 6:18 AM

REsquito-WMF triaged this task as High priority.

REsquito-WMF moved this task from To Be Estimated/To Be Discussed to Sprint 50 on the Wikimedia Enterprise board.

REsquito-WMF edited projects, added Wikimedia Enterprise (Sprint 50); removed Wikimedia Enterprise.

JArguello-WMF changed the task status from Open to In Progress.Nov 7 2023, 2:23 PM

JArguello-WMF moved this task from Next Up to Incident/Unexpected work on the Wikimedia Enterprise (Sprint 50) board.

Hello @jberkel! Thanks for your feedback. We understand the frustration that can arise from delayed responses, and please know that your concerns have not gone unnoticed. Our team is fully aware of the impact this delay has had, and we are committed to rectifying the situation as promptly as possible.

While we cannot guarantee an immediate resolution, I want to assure you that the matter is currently at the top of our agenda. We have marked it as an 'expedited' topic to be tackled with the utmost priority. We appreciate your understanding and patience as we work on the ticket.

Thank you for your continued interest in using these database dumps.

JArguello-WMF edited projects, added Wikimedia Enterprise (Sprint 51); removed Wikimedia Enterprise (Sprint 50).Nov 16 2023, 2:33 PM

JArguello-WMF moved this task from Next Up to Incident/Unexpected work on the Wikimedia Enterprise (Sprint 51) board.

awight added a project: WMDE-TechWish-Maintenance-2023.Nov 21 2023, 9:25 AM

awight moved this task from Backlog to Watching / Epic on the WMDE-TechWish-Maintenance-2023 board.

Hello,

The team continues to work on this issue, we have detected autoscaling issues that we have addressed and continue to dig deeper into other potential causes, after a root cause analysis.

We will post more updates as we go along with the research.

Thank you

JArguello-WMF edited projects, added Wikimedia Enterprise (Sprint 52); removed Wikimedia Enterprise (Sprint 51).Nov 30 2023, 3:27 PM

JArguello-WMF moved this task from Next Up to Incident/Unexpected work on the Wikimedia Enterprise (Sprint 52) board.

Hello,

We have made a change in the last 2 weeks and are analysing the results in order to figure out if there's less discrepancies, if you find any please let us know.

We also continue to look into improvements of our snapshot process.

Thank you

jberkel updated the task description. (Show Details)Dec 11 2023, 8:13 PM

@REsquito-WMF not sure if the changes were already in place, but the current enwiktionary NS0 dump is still at 3.5 GB (compared to 13 GB on 20230701).

awight mentioned this in T353468: Add an errata to published "v1" scraper data.Dec 18 2023, 10:35 AM

awight added a parent task: T353771: Calculate baseline reference metrics.Dec 20 2023, 9:40 AM

jberkel updated the task description. (Show Details)Dec 31 2023, 10:41 AM

JArguello-WMF edited projects, added Wikimedia Enterprise (sprint 53); removed Wikimedia Enterprise (Sprint 52), WMDE-TechWish-Maintenance-2023.Jan 4 2024, 5:01 PM

JArguello-WMF moved this task from Next Up to Incident/Unexpected work on the Wikimedia Enterprise (sprint 53) board.

awight mentioned this in T354369: Scraper: sanity check dump size by including the sitestats article count.Jan 4 2024, 5:29 PM

@jberkel Happy new year.

We returned to work and we made a change in a configuration, it should be updated tomorrow.

Thank you.

Hi,

The change we made had a great impact:

Before: 1290827 total pages

Current: 5812947 total pages

Expected: 7921988 total pages

the missing pages are related to a bug we are tracking here: https://phabricator.wikimedia.org/T351712

In that sense, we are going to be tracking the rest of the work there

REsquito-WMF moved this task from Incident/Unexpected work to Done on the Wikimedia Enterprise (sprint 53) board.Jan 9 2024, 4:24 PM

awight updated the task description. (Show Details)Jan 9 2024, 5:01 PM

@REsquito-WMF thanks! So this means the next dumps will have more data, but will still be incomplete until this other bug is fixed?

Correct @jberkel

OK. I think it might be worth putting a disclaimer somewhere, perhaps on https://dumps.wikimedia.org/other/enterprise_html/, to warn users that the dumps are incomplete.

JArguello-WMF closed this task as Unknown Status.Jan 18 2024, 1:11 PM

jberkel updated the task description. (Show Details)Jan 26 2024, 12:08 PM

Latest enwikt dump is now at 9.6 GB, still some way to go to the 13GB of the 20230701 dump (also incomplete, but still useful as a baseline).

I'm wondering what's the deal with the Closed as Unknown Status here, haven't seen this before and I'm unsure about its meaning.

awight mentioned this in T125803: Closed task with "Unknown status".Jan 30 2024, 11:49 AM

Veikk0.ma subscribed.Feb 1 2024, 7:14 PM

jberkel updated the task description. (Show Details)Feb 2 2024, 11:49 AM

jberkel updated the task description. (Show Details)Feb 21 2024, 8:51 PM

jberkel updated the task description. (Show Details)Mar 1 2024, 1:47 PM

Aklapper changed the task status from Unknown Status to Resolved.Mar 18 2024, 10:40 AM

So this has been resolved? 13GB of the 20230701 dump was so large why? Because it contained duplicate documents? Otherwise it is unclear why it is just 9.6 GB now.

It probably means the investigation has been "resolved". The main task is now T351712 + subtasks.

Can anyone clarify though? It seems that the new sub-tasks are now stuck again.

jberkel updated the task description. (Show Details)Mar 25 2024, 11:02 AM

	F37727916: enwikivoyage.png
	Sep 15 2023, 8:47 PM

	F37727918: enwikisource.png
	Sep 15 2023, 8:47 PM

	F37727920: wiktionary_sizes.png
	Sep 15 2023, 8:47 PM

	F37727922: wikipedia_sizes.png
	Sep 15 2023, 8:47 PM

{Investigation} Different file sizes for dumpsClosed, ResolvedPublicBUG REPORTActions

Description

Related ObjectsSearch...

Event Timeline

{Investigation} Different file sizes for dumps
Closed, ResolvedPublicBUG REPORT
Actions

Related Objects
Search...