Page MenuHomePhabricator

Incomplete files uploaded - chunked upload drops last chunk.
Closed, ResolvedPublicBUG REPORT

Description

Current most likely theory: AssembleChunkedUpload Job depends on value in upload_stash.us_chunk_inx. Job's transaction might start prior to the APIs transaction being finished committing, resulting in us_chunk_inx being off by 1, causing the last chunk to be dropped.


looks like T190988 and T266903 have a new edition. this time making errors at 10 Mb and leaving a grey area behind.

https://commons.wikimedia.org/w/index.php?sort=create_timestamp_desc&search=filesize%3A10239%2C10241 examples

https://commons.wikimedia.org/wiki/File:%D0%92%D0%B8%D0%B4_%D0%BD%D0%B0_%D0%BE%D0%BA%D1%80%D0%B5%D1%81%D1%82%D0%BD%D0%BE%D1%81%D1%82%D0%B8_%D0%95%D0%97%D0%A1%D0%9A_%D0%B8%D0%B7_%D0%B3%D0%B0%D1%80%D0%B0%D0%B6%D0%BD%D0%BE%D0%B3%D0%BE_%D0%BA%D0%BE%D0%BE%D0%BF%D0%B5%D1%80%D0%B0%D1%82%D0%B8%D0%B2%D0%B01.jpg
https://commons.wikimedia.org/wiki/File:Himalayan_Tibet_Museum_-_Darjeeling_-_Exhibits_(4).jpg
https://commons.wikimedia.org/wiki/File:Lille_6,_P%C3%A4rnu_2.jpg
https://commons.wikimedia.org/wiki/File:Pengantin_pria_melayu_sedang_duduk_tampak_depan_4.jpg
https://commons.wikimedia.org/wiki/File:Festung_W%C3%BClzburg_2023_-_028.jpg
https://commons.wikimedia.org/wiki/File:WIkiQuiz_2023._5-bosqich_18.jpg

can you plz fix thix asap so you dont create another backlog like https://commons.wikimedia.org/wiki/Category:Incomplete_JPG_files_(5_MB_interruption) ?
also plz identify all problematic files and create a tracking category like the old one, probably titled "Incomplete_JPG_files_(10_MB_interruption)".

Event Timeline

it seems the bug has been lurking for years but gone unnoticed.

https://commons.wikimedia.org/wiki/File:Zapis-0592-Riljac-Glisica-hrast_20160402_2738.jpg uploaded on 18 November 2016 is probably the earliest surviving file with this problem.

Aklapper removed a subscriber: tstarling.

+SRE-swift-storage as I doubt that the Commons community can solve this on their own.

Perhaps worth noting that some of these images predate the move to envoy too (e.g. https://commons.wikimedia.org/wiki/File:Youngtimer_Trophy_Schwedenkreutz_2023_15.jpg ) so I don't think this is related to the recent nginx->envoy swift changes. I'm tempted to suggest a client issue?

Picking a recent failure:

mvernon@cumin1001:~$ sudo cumin -x --force --no-progress --no-color -o txt O:swift::proxy "zgrep -F '0/0e/Wikidata_43.jpg' /var/log/swift/proxy-access.log.3.gz" >~/junk/T350917.txt
#Cumin output elided
mvernon@cumin1001:~$ grep " PUT " junk/T350917.txt | grep 'public'
moss-fe2001.codfw.wmnet: Nov 24 19:12:12 moss-fe2001 proxy-server: 10.192.48.101 10.192.32.51 24/Nov/2023/19/12/12 PUT /v1/AUTH_mw/wikipedia-commons-local-public.0e/0/0e/Wikidata_43.jpg HTTP/1.0 201 - wikimedia/multi-http-client%20v1.1 AUTH_tk77cb529d3... 10485760 - 574193492001fdf36bdf02cbd36887a4 tx56ca15da11c2455281325-006560f58b - 0.2022 - - 1700853131.809281588 1700853132.011436224 0
ms-fe1013.eqiad.wmnet: Nov 24 19:12:12 ms-fe1013 proxy-server: 10.192.48.101 10.64.48.149 24/Nov/2023/19/12/12 PUT /v1/AUTH_mw/wikipedia-commons-local-public.0e/0/0e/Wikidata_43.jpg HTTP/1.0 201 - wikimedia/multi-http-client%20v1.1 AUTH_tk3f3803a6c... 10485760 - 574193492001fdf36bdf02cbd36887a4 tx460fd22327174e6ba0df8-006560f58c - 0.3835 - - 1700853132.315957069 1700853132.699444294 0

So you can see there are two successful PUTs (one to each ms cluster), both from mw2279.codfw.wmnet, of the same 10485760-byte object.

So from the swift point of view, it was offered a 10M object, and stored it as requested, so I think the problem lies further up the stack (uploadwizard was used, maybe it was confused by a client interruption or somesuch?)

they dont always fail.
they always fail sporadically.
it can actually be fixed by uploading the pic again, meaning it's not reproducible.
it's not like if you upload a certain pic then you always get the 5mb or 10mb error.

just a wild guess. could it be like, a pic is uploaded by chunks of 5mb. somehow it's cut off at a chunk but the receiving end thinks it finished?
notably, it seems this bug only affects images but not pdfs or videos. it mostly affects jpg.
so it must be related to the differences between the ways in which jpg and pdf are uploaded and checked?

Picking a recent failure:

mvernon@cumin1001:~$ sudo cumin -x --force --no-progress --no-color -o txt O:swift::proxy "zgrep -F '0/0e/Wikidata_43.jpg' /var/log/swift/proxy-access.log.3.gz" >~/junk/T350917.txt
#Cumin output elided
mvernon@cumin1001:~$ grep " PUT " junk/T350917.txt | grep 'public'
moss-fe2001.codfw.wmnet: Nov 24 19:12:12 moss-fe2001 proxy-server: 10.192.48.101 10.192.32.51 24/Nov/2023/19/12/12 PUT /v1/AUTH_mw/wikipedia-commons-local-public.0e/0/0e/Wikidata_43.jpg HTTP/1.0 201 - wikimedia/multi-http-client%20v1.1 AUTH_tk77cb529d3... 10485760 - 574193492001fdf36bdf02cbd36887a4 tx56ca15da11c2455281325-006560f58b - 0.2022 - - 1700853131.809281588 1700853132.011436224 0
ms-fe1013.eqiad.wmnet: Nov 24 19:12:12 ms-fe1013 proxy-server: 10.192.48.101 10.64.48.149 24/Nov/2023/19/12/12 PUT /v1/AUTH_mw/wikipedia-commons-local-public.0e/0/0e/Wikidata_43.jpg HTTP/1.0 201 - wikimedia/multi-http-client%20v1.1 AUTH_tk3f3803a6c... 10485760 - 574193492001fdf36bdf02cbd36887a4 tx460fd22327174e6ba0df8-006560f58c - 0.3835 - - 1700853132.315957069 1700853132.699444294 0

So you can see there are two successful PUTs (one to each ms cluster), both from mw2279.codfw.wmnet, of the same 10485760-byte object.

So from the swift point of view, it was offered a 10M object, and stored it as requested, so I think the problem lies further up the stack (uploadwizard was used, maybe it was confused by a client interruption or somesuch?)

The most likely answer is that the error occurred at the chunked upload stage (either during assembly, or during individual steps). The corrupted upload then got combined into a stashed upload. That stashed upload then got renamed into Wikidata_43.jpg. So the relavent logs wouldn't be for Wikidata_43.jpg but for whatever the filekey was for that upload. Its also possible this is some sort of front-end logic bug

Anyways, i recently added a bunch of additional logs on the mediawiki side, so if this happens again, let me know, and there might be more logging we can look at.

If we want this more reliable, we could consider having the front-end send an sha1 hash and then checking on the backend that it matches.

More recent example is File:Delft_Van_Miereveltlaan_7.jpg (aka 1aqdj7jclxmc.afcsnk.1553787.jpg aka 1aqdj6vvp0a8.a3pwzj.1553787.jpg ) which is cut off at 10MB.

So one weird thing about this File, it seems like UploadWizard is sending the filename "0217van miereveltlaan 7.jpg" for it at earlier steps. The filename at beginning steps are ignored, only the last one is taken into account, but that is a tad weird.

During upload, the correct file size was sent (16073728), and 3 chunks were sent. It looks like the last chunk was dropped.

My suspicion here is that this is some sort of db latency situation where MW job queue does not see the last chunk (upload wizard uses 5mb chunks but they can be any size). The chunk index comes from us_chunk_inx, which is fetched using Primary DB from uploadstash.us_chunk_inx. However it is plausible that the job could start running before the db transaction that modified the table completes. Then the job would get the view of the table from what it looked like at the previous chunk, thus missing the most recently uploaded chunk.

MW has the filesize the file should be, there should probably be some error detection here.

Bawolff renamed this task from Incomplete files uploaded (10 MB interruption) to Incomplete files uploaded - chunked upload drops last chunk..Feb 22 2024, 11:59 PM
Bawolff updated the task description. (Show Details)

someone should do something about the buggy uploads.

  1. find a way to check files within these boundaries of multiples of 5 mb for corrupted ones. tag the files.
  2. send out reminders to uploaders to ask them reupload.

As an aside some historical discussion at https://static-codereview.wikimedia.org/MediaWiki/104687.html (for context, back in the day we used mysql for job tracking which probably made the code less risky at the time)

Change 1005865 had a related patch set uploaded (by Brian Wolff; author: Brian Wolff):

[mediawiki/core@master] Prevent race condition between AssembleUploadChunksJob and transaction

https://gerrit.wikimedia.org/r/1005865

Change 1005965 had a related patch set uploaded (by Brian Wolff; author: Brian Wolff):

[mediawiki/core@master] Make chunked upload delete chunks in deferred job

https://gerrit.wikimedia.org/r/1005965

Change 1005965 had a related patch set uploaded (by Brian Wolff; author: Brian Wolff):

[mediawiki/core@master] Make chunked upload delete chunks in deferred job

https://gerrit.wikimedia.org/r/1005965

This was meant to be for a different bug, not this one.

Change 1005865 merged by jenkins-bot:

[mediawiki/core@master] Prevent race condition between AssembleUploadChunksJob and transaction

https://gerrit.wikimedia.org/r/1005865

Well so far so good, but may be a bit early to tell so I will keep this bug open until a week has passed to make sure the issue is really fixed. The fix should be in place as of Feb 28 19:30 UTC. The only upload since then that is a multiple of 5mb is [[File:OR-126,_Knowles_Tunnel_-_53555943932.jpg]], which appears to be a false positive, as the source image is exactly 20MB. Prior to this we were having about 1 corrupted upload a day.

Oddly enough, flicker images seem to often be a multiple of 5MB. They aren't corrupted, flicker just has that as a default size or something.

It'd be cool if someone with a bot would notify people who have uploaded potentially affected files. There is a list of such files at https://quarry.wmcloud.org/query/80784# (This list only includes files uploaded after June 2023. It is possible there are some false positives on the list).

Bawolff closed this task as Resolved.EditedMar 5 2024, 5:15 AM
Bawolff claimed this task.

Its been about 5 days and no new examples. Previously there was about 2 a day. I'm going to call this fixed.

Also, i don't see any mention of the exception i added for double checking in logs, so I think that means that lazyPush() fixed the underlying cause, and we aren't just aborting when encountering the issue.