Page MenuHomePhabricator

Some corrupt thumbs remain from initial Swift deploy
Closed, ResolvedPublic

Description

Ralf Schmitt 2012-02-24 09:40:59 UTC reported in bug 34611#c3 :

btw, upload.wikimedia.org is currently serving corrupt thumb images
(see below). What makes you think that you solved the problem?

,----

wget -S

http://upload.wikimedia.org/wikipedia/commons/thumb/5/58/Commons-emblem-disambig-notice.svg/1200px-Commons-emblem-disambig-notice.svg.png

--2012-02-24 10:30:55--

http://upload.wikimedia.org/wikipedia/commons/thumb/5/58/Commons-emblem-disambig-notice.svg/1200px-Commons-emblem-disambig-notice.svg.png

Resolving upload.wikimedia.org... 208.80.152.211
Connecting to upload.wikimedia.org208.80.152.211:80... connected.
HTTP request sent, awaiting response...
HTTP/1.0 200 OK
Last-Modified: Thu, 02 Feb 2012 17:10:31 GMT
Accept-Ranges: bytes
Content-Type: image/png
Content-Length: 102400
Date: Mon, 20 Feb 2012 03:49:30 GMT
Age: 366058
X-Cache: HIT from sq83.wikimedia.org
X-Cache-Lookup: HIT from sq83.wikimedia.org:3128
X-Cache: MISS from sq84.wikimedia.org
X-Cache-Lookup: MISS from sq84.wikimedia.org:80
Connection: keep-alive
Length: 102400 (100K) [image/png]
Saving to: `1200px-Commons-emblem-disambig-notice.svg.png'
100%[======================================>] 102,400 112K/s in 0.9s
2012-02-24 10:30:56 (112 KB/s) -

`1200px-Commons-emblem-disambig-notice.svg.png' saved [102400/102400]

[py27] ~/t/ % md5sum 1200px-Commons-emblem-disambig-notice.svg.png
4a42cbe023060d011d6dc1f92572eb1c

1200px-Commons-emblem-disambig-notice.svg.png

[py27] ~/t/ % display 1200px-Commons-emblem-disambig-notice.svg.png
display: Expected 8192 bytes; found 3893 bytes

`1200px-Commons-emblem-disambig-notice.svg.png' @
warning/png.c/MagickPNGWarningHandler/1754.

display: Read Exception `1200px-Commons-emblem-disambig-notice.svg.png' @

error/png.c/MagickPNGErrorHandler/1728.

display: corrupt image `1200px-Commons-emblem-disambig-notice.svg.png' @

error/png.c/ReadPNGImage/3695.


Version: unspecified
Severity: normal

Details

Reference
bz34695
TitleReferenceAuthorSource BranchDest Branch
Add catalyst feature flagrepos/qte/catalyst/patchdemo!10jhuneidiT366955master
Customize query in GitLab

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 12:13 AM
bzimport set Reference to bz34695.

ralf_wikimedia wrote:

*** Bug 34611 has been marked as a duplicate of this bug. ***

ralf_wikimedia wrote:

Adding my comment from Bug 34611:

But I guess not all of the corrupt images have been removed. Judging from the
ones I looked at today these were all .svg images and the truncated files have
a filesize that is a multiple of 4096.

bhartshorne wrote:

The initial run to purge broken thumbnails reduced our incidence from about 1.5% of all thumbnails to 0.003%, but I believe there are still a few left. I am currently working on a slow process to cull the rest (this will likely run for at least 2 weeks to complete). Though it will take a long time I think it's ok given the low incidence.

Re: Ralf's comment "what makes you think you solved the problem?": we were able to recreate the issue by initiating a connection to swift requesting a thumbnail that doens't alraedy exist and closing the connection before the entire thumbnail is returned. Closing the client side early resulted in a truncated file getting written to Swift. We adjusted the code in Swift to pay attention to the Content-Length header and the ETag headers (if they exist) and at the same time adjusted the code on ms5 (Swift's current backend) and the image scalers to create content-length and ETag headers whenever possible. After making these changes, closing the client connection prematurely resulted in nothing getting written to Swift instead of a truncated image. The PUT to swift would fail because the closed connection meant that the data pushed into the system did not match whichever headers were available.

While we can never be absolutely sure that a different bug with the same symptoms doesn't also exist, all my tests so far have been unable to recreate truncated images in Swift. Additionally, I installed a process to monitor roughly 30% of all newly created Swift objects and check them against the copy on ms5 to identify any new incidence of the same (or similar) bugs. This monitoring process hasn't seen any truncated images appear since we deployed the fix to the dropped connection bug.

The files referenced in this bug (the Commons emblem) was created truncated in swift prior to the deploy of the fix for the dropped connection bug, so is a left over remnant rather than a new example.

I'll close this bug when the final cleanup of the remaining broken thumbnails is complete.

bhartshorne wrote:

(oh, I forgot; in the mean time, if there are specific images you find that are truncated, please feel free to ?action=purge on them. That will clear up the problem for a specific image that's affecting you while I continue to do the more complete scan of all thumbnails.)

ralf_wikimedia wrote:

The best I can (sanely) do here is purge all images that have a filesize which is a multiple of 4096. But, I think you should be able to do that with much less overhead.

ralf_wikimedia wrote:

*** Bug 34611 has been marked as a duplicate of this bug. ***

ralf_wikimedia wrote:

Doesn't the "?action=purge" open up a good opportunity for a DOS attack?
We already know that the current system can't handle the load generated by the pdf cluster if all of the thumbnails have to be regenerated.

taking 1.19 milestone off of this bug since we have it mostly solved and it'll take longer than this Wednesday to fix.

Update on this issue. Ben wrote a 'delete-old-objects' script just before going out-of-office for a while, which will delete all thumbnails generated before February 5. Leslie has taken over the process of running this, which is a long running process, but is 70% (?) done now. Basically, there are 5 Swift backend boxes, and the process has run on #1-3 already, it's running on #4, so #5 is the only one left untouched.

After this process is done, there may be a *few* images left (since I think there's a grey zone between February 5 and when we're much more certain that things are fixed), so there may be another much shorter pass that's needed. Ben should be back in the office to finish this off, do some verification, and then mark this bug fixed, asking for independent confirmation.

bhartshorne wrote:

Update:

The script to delete truncated images has run completely a few times and eliminated most of the truncated images. There were some left and with more digging I found that they were objects in swift that do not exist in the container listings. (as though you can read a file in a directory but when you list the contents of the directory you don't see the file.)

I've started a process that is crawling every object in swift and testing to verify that it is present in the container listing, deleting those that aren't. So far it has found several (in the tens of objects per commons shard) objects that aren't listed in the containers - about 0.04%. Based on the progress of the script so far I expect it should take about 12 days to complete the sweep. Results so far show that the most recent file that exists but is not listed in a container is 2012-03-20, so it seems that whatever triggered the bug that allowed them to exist is no longer happening.

Note that most of the objects missing from the container listings are not actually truncated images, but it is best to purge them anyways, since they will still cause trouble if the original image is updated. In other words, there are two problems: truncated images and objects missing from the container listing. When both problems affect a single file, the symptom is at truncated file that can't be purged.

After many runs of the cleaner script and the fact that we have long since disabled the PUT code in rewrite.py that caused problems, and I haven't reports on this occurring, I'm closing this bug.

Gilles raised the priority of this task from High to Unbreak Now!.Dec 4 2014, 10:21 AM
Gilles moved this task from Untriaged to Done on the Multimedia board.
Gilles lowered the priority of this task from Unbreak Now! to High.Dec 4 2014, 11:22 AM