Missing Enterprise Dumps in 2023-04-20, 2023-05-01 and 2023-05-20 runs
Closed, ResolvedPublic8 Estimated Story PointsBUG REPORT
Actions

Assigned To

Authored By

	jberkel
	May 2 2023, 11:07 AM

Description

Steps to replicate the issue (include links if applicable):

Check

What happens?:

404 => dump for English Wiktionary is incomplete, only present is NS6.
dump for Hausa Wikipedia is missing, NS10 is present but NS0 is missing.

Details

	Subject	Repo	Branch	Lines +/-
	Modify runtime of html dumps rsync to secondary host	operations/puppet	production	+1 -1
	Increase number of retries for html dumps download	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		None	T345411 Scraper: destroy Cloud VPS runner instance
Resolved		None	T341751 Publish dump scraper reports
Resolved		None	T335411 Scraper: produce spreadsheet of scraped statistics for comparing wikis
Resolved		awight	T332032 Create baseline statistics for reference usage (2023)
Resolved	BUG REPORT	Protsack.stephan	T335761 Missing Enterprise Dumps in 2023-04-20, 2023-05-01 and 2023-05-20 runs

Event Timeline

jberkel created this task.May 2 2023, 11:07 AM

The 2023-04-20 run was not completed because of a token refresh failure we experienced (here https://phabricator.wikimedia.org/T335368) The current run is still in progress and we can’t say for now if the files are missing or not; we’ll have to wait for the dumps to be completed.

Thanks! Is there any way to check the HTML dump progress/state "from the outside"? The XML dumps have a status page + the machine readable dumpstatus.json

ArielGlenn moved this task from Backlog to Active on the Dumps-Generation board.May 3 2023, 11:42 AM

Change 914800 had a related patch set uploaded (by Hokwelum; author: Hokwelum):

[operations/puppet@production] Increase number of retries for html download

https://gerrit.wikimedia.org/r/914800

gerritbot added a project: Patch-For-Review.May 3 2023, 3:01 PM

Change 914800 merged by ArielGlenn:

[operations/puppet@production] Increase number of retries for html dumps download

https://gerrit.wikimedia.org/r/914800

Maintenance_bot removed a project: Patch-For-Review.May 3 2023, 4:10 PM

Currently, there isn’t any way for you to track the progress of the dumps because we don’t produce them locally; we fetch them via an API from WME. At the time of the run, the WME lets us access a list of wikis and namespace, this information is also retrieved on the fly rather than from a static list, and they could change on the next run.

Ok, so the files have been generated, but not copied? Can they be recovered?

ArielGlenn subscribed.May 4 2023, 11:38 AM

In T335761#8825657, @jberkel wrote:

Ok, so the files have been generated, but not copied? Can they be recovered?

It's not so much a question of recovering the files as of downloading them via the Wikimedia Enterprise API. A backfill job was started ysterday to grab the latest versions of missing files, and it may take a day or two to complete. Check the directory once in a while and see.

Yes that's what I meant, thanks 🤞

The files haven't materialized, guess something is still amiss…

MrBeef12 mentioned this in T336243: Mysterious missing enterprise HTML dumps.May 9 2023, 10:40 AM

MrBeef12 subscribed.

ArielGlenn merged a task: T336243: Mysterious missing enterprise HTML dumps.May 10 2023, 11:21 AM

ArielGlenn added a subscriber: awight.

awight renamed this task from Missing Wiktionary Enterprise Dumps in 2023-04-20 and 2023-05-01 runs to Missing Enterprise Dumps in 2023-04-20 and 2023-05-01 runs.May 10 2023, 11:31 AM

awight updated the task description. (Show Details)

In T335761#8832532, @jberkel wrote:

The files haven't materialized, guess something is still amiss…

The backfill stopped after permission errors for rowiki and kowiki for ns14; no idea why those are listed but forbidden. (The exact error was 401 so Unauthorized.)
I've restarted it and will be watching.

The files we are trying to backfill have md5sums that do not match the value that the dump info contains. Example:

dump info for dewiki, ns 6: b'{"identifier":"dewiki","version":"d8c66bcef6b0f5a4d68ccc8e3b4bf71c","date_modified":"2023-05-10T06:12:25.316582542Z","in_language":{"identifier":"de"},"size":{"value":196.179e0,"unit_text":"MB"}}

but the actual file has:

dumpsgen@clouddumps1001:/srv/dumps/xmldatadumps/public/other/enterprise_html/runs/20230501$ md5sum dewiki-NS6-20230501-ENTERPRISE-HTML.json.tar.gz.tmp
bc24cacd639a4d2a6e2f26f8cbc1b85b  dewiki-NS6-20230501-ENTERPRISE-HTML.json.tar.gz.tmp

This is from a run conducted just minutes ago.

I don't know what might be causing the discrepancy. Pining @Protsack.stephan for ideas.

ArielGlenn added a subscriber: Protsack.stephan.May 10 2023, 5:18 PM

Sorry somehow missed this tag, I'm going to take a look into logs to see if there's something interesting there.

I have a quick question, how the checksum is being fetched? I mean before or after the download was initiated?

In T335761#8845049, @Protsack.stephan wrote:

...

I have a quick question, how the checksum is being fetched? I mean before or after the download was initiated?

We retrieve it just befor the download, do the download, and if they don't match we retry getting both. Note that in this case I manually ran the retrieval of the md5sum and the download several times in succession, and the download was quite fast. So it's not that for example some later dump file had been generated; the script is designed to deal with that.

I should also note that the downloaded file had the same md5sum when that was computed at the command line, each time. And as far as I could tell, the file was a complete tar.gz file, not truncated in any way.

awight added a parent task: T332032: Create baseline statistics for reference usage (2023).May 12 2023, 7:43 AM

Change 919859 had a related patch set uploaded (by Hokwelum; author: Hokwelum):

[operations/puppet@production] Modify runtime of html dumps rsync to secondary host

https://gerrit.wikimedia.org/r/919859

gerritbot added a project: Patch-For-Review.May 15 2023, 3:48 PM

Another question, where are the enterprise dumps stored on toolforge now? They seem to have stopped updating October last year.

$ ls /public/dumps/public/other/enterprise_html/runs/
20220720  20220801  20220820  20220901	20220920  20221001

In T335761#8856843, @jberkel wrote:
Another question, where are the enterprise dumps stored on toolforge now? They seem to have stopped updating October last year.
$ ls /public/dumps/public/other/enterprise_html/runs/
20220720  20220801  20220820  20220901	20220920  20221001

The rsync job meant to update the dumps after files have been downloaded on the primary host has not been running since last year. It was recently fixed and we expect the data on clouddumps1002 to be updated on the next run.

In T335761#8859149, @Hokwelum wrote:
In T335761#8856843, @jberkel wrote:
Another question, where are the enterprise dumps stored on toolforge now? They seem to have stopped updating October last year.
$ ls /public/dumps/public/other/enterprise_html/runs/
20220720  20220801  20220820  20220901	20220920  20221001
The rsync job meant to update the dumps after files have been downloaded on the primary host has not been running since last year. It was recently fixed and we expect the data on clouddumps1002 to be updated on the next run.

Alright, maybe the stars will align and we'll see some dumps next week :)

Change 919859 merged by ArielGlenn:

[operations/puppet@production] Modify runtime of html dumps rsync to secondary host

https://gerrit.wikimedia.org/r/919859

Maintenance_bot removed a project: Patch-For-Review.May 17 2023, 3:11 PM

In T335761#8845049, @Protsack.stephan wrote:

Sorry somehow missed this tag, I'm going to take a look into logs to see if there's something interesting there.

I have a quick question, how the checksum is being fetched? I mean before or after the download was initiated?

Hello @Protsack.stephan, please did you have time to look into this? In a few days, the next download will begin, and it will be helpful to know what to expect :-)

I'm sorry, I didn't have a chance to take a proper look into that yet.
I've just double checked the checksums in API and in underlying storage and have not found any issues there (at the moment ofc, problem is that we don't keep history).

I'll try to dive deeper before the next run, but let's see and monitor what happens if I didn't mange to find anything before it starts.
We were doing a deployment and fixing a bug right about the time the download was happening so that might have been the cause, but I can't be sure at the moment.

Main reason I can see why it happened, is that metadata request provided wrong checksum information. That might have happened due to failed jobs on our side of things that have published new snapshos but have not updated the metadata after publishing it. I don't see any hiccups or failed jobs at our side of things at the moment, so looks like it should be ok (at least on paper).

dewiki is entirely missing from https://dumps.wikimedia.org/other/enterprise_html/runs/20230520/ , which is a bit of a new failure mode AFAICT? Previously there would be some but not all namespaces present.

jberkel renamed this task from Missing Enterprise Dumps in 2023-04-20 and 2023-05-01 runs to Missing Enterprise Dumps in 2023-04-20, 2023-05-01 and 2023-05-20 runs.May 22 2023, 7:06 AM

jberkel updated the task description. (Show Details)

In T335761#8869953, @awight wrote:

dewiki is entirely missing from https://dumps.wikimedia.org/other/enterprise_html/runs/20230520/ , which is a bit of a new failure mode AFAICT? Previously there would be some but not all namespaces present.

Note that the downloader is still running.

Thanks for the note--I also see the new schedule is --2,4,6,21,23,25 8:30:0, so I'll wait until Friday to draw conclusions about this run.

In T335761#8862279, @Protsack.stephan wrote:

...

Main reason I can see why it happened, is that metadata request provided wrong checksum information. That might have happened due to failed jobs on our side of things that have published new snapshos but have not updated the metadata after publishing it. I don't see any hiccups or failed jobs at our side of things at the moment, so looks like it should be ok (at least on paper).

Running into similar issues this month. Is there caching we're not aware of perhaps? At any rate, here's an example:
vecwiki, ns0, claimed md5sum is 534e3cf22ad45e83dcd22b700f4e04a9 and actual is 5ebd3b97283d859741bfaae7cc031c11, tested a few times in a row with the same values each time.
url for getting the claimed md5sum: https://api.enterprise.wikimedia.com/v1/exports/meta/0/vecwiki
url for getting the ns0 content: https://api.enterprise.wikimedia.com/v1/exports/download/0/vecwiki

@ArielGlenn Is the downloaded data usable, that is, can you decompress the files without error? If the files are OK, maybe it's a problem with the checksum generation: if the checksums are off only for some files, it could be related to the file size. Perhaps some sort of overflow where the hashes are calculated?

Just throwing some ideas out there…

In T335761#8878857, @jberkel wrote:

@ArielGlenn Is the downloaded data usable, that is, can you decompress the files without error? If the files are OK, maybe it's a problem with the checksum generation: if the checksums are off only for some files, it could be related to the file size. Perhaps some sort of overflow where the hashes are calculated?

Just throwing some ideas out there…

I tested on a relatively small file, the same file several times, just to see. And the file itself appears to be intact. uncompressing fine and with tar extraction running fine as well. Additionally the md5sum run from the command line on the file via the linux md5sum utility yields the same results as the computed one from the script. This leads me to believe that there is a problem with the md5sum the API claims to have for the file.

@Protsack.stephan Where are the checksums calculated? Can you re-index the metadata of the dump files on the API side so that they match the actual file content? It looks like they might get calculated before the file is fully processed, or they are calculated from a different version of the file (as you indicated in your comment)?

@ArielGlenn if the API side isn't fixed until the June run would it be possible to ignore the checksums and copy the files regardless? We've been dump-less for 2 months now…

In T335761#8886310, @jberkel wrote:

@ArielGlenn if the API side isn't fixed until the June run would it be possible to ignore the checksums and copy the files regardless? We've been dump-less for 2 months now…

The way we know that our downloads are good is via the md5sums; without that, we have no guarantee that the files are intact, just as downloaders from dumps.wikimedia.org check the md5ums after they download these same files. I appreciate your position, but we really need to get the issue sorted, whatever it may be.

Sorry for being slow on replies, we are looking into this at the moment. Will get back with updates this week.

Protsack.stephan triaged this task as High priority.May 29 2023, 10:43 AM

Protsack.stephan moved this task from Incoming to To Be Estimated/To Be Discussed on the Wikimedia Enterprise board.

In T335761#8879378, @jberkel wrote:

@Protsack.stephan Where are the checksums calculated? Can you re-index the metadata of the dump files on the API side so that they match the actual file content? It looks like they might get calculated before the file is fully processed, or they are calculated from a different version of the file (as you indicated in your comment)?

We are relying on the object store to calculate the checksum and then we just serve it from the API.
I'm trying to figure out whether this has anything to do with us using multipart uploads.
It might be the case that we are just serving the checksum of the previous dump.
Meaning: we are grabbing the checksum before the upload has finished.

It might be the case that we are just serving the checksum of the previous dump.
Meaning: we are grabbing the checksum before the upload has finished.

Thanks for looking into this! Are you directly using Amazon's MD5 values by any chance? It looks like they don't work as expected for multipart uploads:

When you upload an object to Amazon S3, you can specify a checksum algorithm for Amazon S3 to use. Amazon S3 uses MD5 by default to verify data integrity; however, you can specify an additional checksum algorithm to use. When using MD5, Amazon S3 calculates the checksum of the entire multipart object after the upload is complete. This checksum is not a checksum of the entire object, but rather a checksum of the checksums for each individual part

https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html

Always welcome!
Thanks for sharing this, I've came to the same conclusion.
Looking into potential fixes and trying to figure out the best way to handle this.

Protsack.stephan claimed this task.May 29 2023, 3:25 PM

Protsack.stephan set the point value for this task to 8.

Protsack.stephan moved this task from To Be Estimated/To Be Discussed to In Progress on the Wikimedia Enterprise board.

In T335761#8886872, @Protsack.stephan wrote:

Looking into potential fixes and trying to figure out the best way to handle this.

It looks like you can just copy a multipart-uploaded object (remote-remote) and it'll recalculate the checksum for you:

With a copy command, the checksum of the object is a direct checksum of the full object. If the object was originally uploaded using a multipart upload, then the checksum value changes even though the data has not.

https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html

Protsack.stephan moved this task from In Progress to Merge Request on the Wikimedia Enterprise board.May 29 2023, 5:11 PM

Protsack.stephan moved this task from Merge Request to Machine Readability PB on the Wikimedia Enterprise board.May 30 2023, 10:27 AM

@jberkel thanks for the info. I think I've fixed the problem. Done some testing and it seems like it's working.

@ArielGlenn Can we do some kind of test run before the next production run starts to confirm that everything's ok?

I can check a specific wiki, as I did above, and see if it works for that one. Will that be sufficient for your purposes? (Are you not able to test on your end though?)

That would be great, I was able to test and it looks fine on my end. I would really like to do a small test on the other side of things just to be sure.

I have checked the same small wiki and namespace as earlier (vecwiki, ns0) and it now checks out.

Thanks @ArielGlenn, everything checks out on my end as well I've tested multiple small and large projects.

We'll find out tomorrow! Keeping fingers crossed.

Looks like the data was copied successfully this time! I've downloaded the enwiktionary-NS0 dump and the checksum matches.

There's still nothing on toolforge, though:

$ ls -l /public/dumps/public/other/enterprise_html/runs/
total 76
drwxr-xr-x 2  400  400 57344 May 17 15:29 20230220
drwxr-xr-x 2  400  400  4096 Mar  2 00:06 20230301
drwx------ 2 root root  4096 May 17 15:11 20230320
drwx------ 2 root root  4096 May 17 15:11 20230401
drwx------ 2 root root  4096 May 17 15:11 20230420
drwx------ 2 root root  4096 May 17 15:11 20230501

Additionally, some of the directories aren't accessible.

In T335761#8897845, @jberkel wrote:
Looks like the data was copied successfully this time! I've downloaded the enwiktionary-NS0 dump and the checksum matches.

There's still nothing on toolforge, though:
$ ls -l /public/dumps/public/other/enterprise_html/runs/
total 76
drwxr-xr-x 2  400  400 57344 May 17 15:29 20230220
drwxr-xr-x 2  400  400  4096 Mar  2 00:06 20230301
drwx------ 2 root root  4096 May 17 15:11 20230320
drwx------ 2 root root  4096 May 17 15:11 20230401
drwx------ 2 root root  4096 May 17 15:11 20230420
drwx------ 2 root root  4096 May 17 15:11 20230501
Additionally, some of the directories aren't accessible.

The rsync, which copies the files over to the nfs share accessible to toolforge, is still in progress.

The rsync, which copies the files over to the nfs share accessible to toolforge, is still in progress.

still in progress?

still in progress?

Yes please :-) The rsync is still in progress!

In T335761#8905844, @Hokwelum wrote:

still in progress?

Yes please :-) The rsync is still in progress!

wow, ok.

looks like the files have finally been synced to toolforge!

Protsack.stephan moved this task from Machine Readability PB to Done Sprint 42 (July 14 - July 27) on the Wikimedia Enterprise board.Jun 9 2023, 3:05 PM

I think this is resolved; the current run is already available on the public web server and on the nfs share for WMCS instances, with the same number of files as the download on the 1st of the month. Closing.

ArielGlenn moved this task from Active to Done on the Dumps-Generation board.Jun 22 2023, 5:40 AM

Missing Enterprise Dumps in 2023-04-20, 2023-05-01 and 2023-05-20 runsClosed, ResolvedPublic8 Estimated Story PointsBUG REPORTActions

Description

Details

Related ObjectsSearch...

Event Timeline

Missing Enterprise Dumps in 2023-04-20, 2023-05-01 and 2023-05-20 runs
Closed, ResolvedPublic8 Estimated Story PointsBUG REPORT
Actions

Related Objects
Search...