Cached thumbnails and originals are sometimes not being purged correctly/quickly
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	AntiCompositeNumber
	Jun 24 2020, 8:03 PM

Description

File:Peoples jumping in La Guardia beach.jpg was first uploaded at 18:56, 6 May 2014. A new version was uploaded 19:19, 24 June 2020. However, after 30 minutes, the old version of the original is still being served from the cache: https://upload.wikimedia.org/wikipedia/commons/c/c1/Peoples_jumping_in_La_Guardia_beach.jpg.

HTTP/2 200 OK
date: Wed, 24 Jun 2020 03:03:46 GMT
content-type: image/jpeg
content-length: 3237571
x-object-meta-sha1base36: 7qmhhjfkd8dmn6ibict0ckuq8y1j98n
accept-ranges: bytes
last-modified: Tue, 06 May 2014 18:56:33 GMT
etag: 563acb8ef5ddc32c537a2e16ae2aaacf
x-timestamp: 1399402592.92937
server: ATS/8.0.7
age: 60353
x-cache: cp1080 hit, cp1082 pass
x-cache-status: hit-local
server-timing: cache;desc="hit-local"
strict-transport-security: max-age=106384710; includeSubDomains; preload
x-client-ip: ****
access-control-allow-origin: *
access-control-expose-headers: Age, Date, Content-Length, Content-Range, X-Content-Duration, X-Cache
timing-allow-origin: *
X-Firefox-Spdy: h2

Requesting an original-size thumbnail results in the new file. Purging the file page has no effect.

This issue is not limited to originals, it also affects thumbnails. https://upload.wikimedia.org/wikipedia/commons/thumb/d/dc/Public_art_-_GH_Piesse%2C_Katanning.jpg/220px-Public_art_-_GH_Piesse%2C_Katanning.jpg was previously affected by the Thumbor EXIF rotation bug, which was fixed a while ago. However, the thumbnail remained rotated and could not be purged. That problem resolved itself when I began typing up a Phab task, two days after it was first mentioned on 2020-06-22. The headers indicated that it was being served by Thumbor, not though ATS. The file is now being served through ATS. Unfortunately, I don't have the headers for any of that anymore.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T43371 Thumbnail/imagescaler (tracking)
		Resolved		• ema	T256313 Cached thumbnails and originals are sometimes not being purged correctly/quickly

Event Timeline

AntiCompositeNumber created this task.Jun 24 2020, 8:03 PM

Restricted Application added projects: SRE, Commons. · View Herald TranscriptJun 24 2020, 8:03 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

AntiCompositeNumber updated the task description. (Show Details)Jun 24 2020, 8:04 PM

this takes a few days before the image is actually updated, thus making it impossible to correct image defects during a nomination process for a Featured Picture on commons, for example

King_of_Hearts subscribed.Jun 24 2020, 9:42 PM

I can report having seen this problem in several incidences as well, including https://upload.wikimedia.org/wikipedia/commons/8/87/St._Venantius%2C_Wertheim%2C_Nave_20160802_1.jpg.

Yet another one, currently still broken as of time of writing: https://upload.wikimedia.org/wikipedia/commons/5/53/Lower_Manhattan_from_Governors_Island_August_2017_panorama.jpg.

Reedy added a project: SRE-swift-storage.Jun 25 2020, 1:41 AM

Aklapper added a parent task: T43371: Thumbnail/imagescaler (tracking).Jun 25 2020, 7:03 AM

Frank_Schulenburg subscribed.Jun 25 2020, 3:19 PM

• ema triaged this task as Medium priority.Jul 2 2020, 8:56 AM

• ema moved this task from Backlog to Caching on the Traffic board.

The cause is most probably T256444. I'm saying this based on two important pieces of information submitted in this bug report by @AntiCompositeNumber (thanks!)

date of the request , date: Wed, 24 Jun 2020 03:03:46 GMT
cache nodes that served the object, x-cache: cp1080 hit, cp1082 pass

The X-Cache header must be read from right to left to understand the order in which the request went through our CDN: first the cache frontend layer on cp1082 let the request pass through without caching (as we do at the frontend layer with objects larger than 256K, this has content-length: 3237571, about 3M). Then the object was served from cache by the backend layer on cp1080. At 3 AM on June 24, cp1080 was not processing purge requests correctly due to T256444:

Screenshot from 2020-07-02 11-03-04.png (1×2 px, 194 KB)

I unfortunately cannot confirm whether the two objects mentioned by @King_of_Hearts were also stale due to the librdkafka issue, but it seems plausible. We are now using a patched version of librdkafka since yesterday at 2020-07-01T11:50 UTC, and things look better for now. Please let me know if you see outdated images again, and don't forget to include the full request headers if possible.

There's been no new report of stale images since the librdkafka upgrade 20 days ago, closing.

	F31913876: Screenshot from 2020-07-02 11-03-04.png
	Jul 2 2020, 9:12 AM

Cached thumbnails and originals are sometimes not being purged correctly/quicklyClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Cached thumbnails and originals are sometimes not being purged correctly/quickly
Closed, ResolvedPublic
Actions

Related Objects
Search...