Page MenuHomePhabricator

Cached thumbnails and originals are sometimes not being purged correctly/quickly
Closed, ResolvedPublic


File:Peoples jumping in La Guardia beach.jpg was first uploaded at 18:56, 6 May 2014. A new version was uploaded 19:19, 24 June 2020. However, after 30 minutes, the old version of the original is still being served from the cache:

HTTP/2 200 OK
date: Wed, 24 Jun 2020 03:03:46 GMT
content-type: image/jpeg
content-length: 3237571
x-object-meta-sha1base36: 7qmhhjfkd8dmn6ibict0ckuq8y1j98n
accept-ranges: bytes
last-modified: Tue, 06 May 2014 18:56:33 GMT
etag: 563acb8ef5ddc32c537a2e16ae2aaacf
x-timestamp: 1399402592.92937
server: ATS/8.0.7
age: 60353
x-cache: cp1080 hit, cp1082 pass
x-cache-status: hit-local
server-timing: cache;desc="hit-local"
strict-transport-security: max-age=106384710; includeSubDomains; preload
x-client-ip: ****
access-control-allow-origin: *
access-control-expose-headers: Age, Date, Content-Length, Content-Range, X-Content-Duration, X-Cache
timing-allow-origin: *
X-Firefox-Spdy: h2

Requesting an original-size thumbnail results in the new file. Purging the file page has no effect.

This issue is not limited to originals, it also affects thumbnails. was previously affected by the Thumbor EXIF rotation bug, which was fixed a while ago. However, the thumbnail remained rotated and could not be purged. That problem resolved itself when I began typing up a Phab task, two days after it was first mentioned on 2020-06-22. The headers indicated that it was being served by Thumbor, not though ATS. The file is now being served through ATS. Unfortunately, I don't have the headers for any of that anymore.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

this takes a few days before the image is actually updated, thus making it impossible to correct image defects during a nomination process for a Featured Picture on commons, for example

ema triaged this task as Medium priority.Jul 2 2020, 8:56 AM
ema moved this task from Triage to Caching on the Traffic board.

The cause is most probably T256444. I'm saying this based on two important pieces of information submitted in this bug report by @AntiCompositeNumber (thanks!)

  • date of the request , date: Wed, 24 Jun 2020 03:03:46 GMT
  • cache nodes that served the object, x-cache: cp1080 hit, cp1082 pass

The X-Cache header must be read from right to left to understand the order in which the request went through our CDN: first the cache frontend layer on cp1082 let the request pass through without caching (as we do at the frontend layer with objects larger than 256K, this has content-length: 3237571, about 3M). Then the object was served from cache by the backend layer on cp1080. At 3 AM on June 24, cp1080 was not processing purge requests correctly due to T256444:

Screenshot from 2020-07-02 11-03-04.png (1×2 px, 194 KB)

I unfortunately cannot confirm whether the two objects mentioned by @King_of_Hearts were also stale due to the librdkafka issue, but it seems plausible. We are now using a patched version of librdkafka since yesterday at 2020-07-01T11:50 UTC, and things look better for now. Please let me know if you see outdated images again, and don't forget to include the full request headers if possible.

ema claimed this task.

There's been no new report of stale images since the librdkafka upgrade 20 days ago, closing.