Page MenuHomePhabricator

404 error opening a specific file on Commons (due to inconsistent state between two swift clusters)
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

What happens?:
The file cannot be opened. The following error message is shown instead.

File not found: /v1/AUTH_mw/wikipedia-commons-local-public.d5/d/d5/Typhoon-Yagi_5.jpg

If you try to open the thumbnail images, a 404 error happens (the following is a sample from the 302 X 214 px thumbnail):

Request from 223.122.158.63 via cp5027 cp5027, Varnish XID 941840582
Upstream caches: cp5027 int
Error: 404, Not Found at Sun, 15 Sep 2024 07:11:10 GMT

What should have happened instead?:
The original image should appear!

Other information (browser name/version, screenshots, etc.):

  • It appears that Mozilla Firefox is unaffected.
  • First reported on Commons at here.
  • There were earlier situations in 2022, one was eventually solved by purging the page, but this time purging does not help.
  • Only this specific file is affected.
  • Might relate to T314712.

Event Timeline

I have no issues visiting the file and opening the original file on Chrome and Safari (both on MacOS and iOS). Didn't try Edge.

Aklapper renamed this task from Failure to Open a Specific File on Commons to 404 error opening a specific file on Commons.Sep 15 2024, 8:22 AM
Aklapper added a project: SRE-swift-storage.

Web browser is irrelevant in this case. This might be a datacenter sync issue, so different regions of the world see different behavior.

It seems that everything is fine when the file was accessed in Germany.

Also just a muggle's question: So far I remember the WMF's server is in Florida, so why data centers at various places will generate different results?

I'm located on the US west coast.

Yesterday, it worked for me with Firefox but failed with Chrome and Edge. The failing server "File not found: /v1/AUTH_mw/wikipedia-commons-local-public.d5/d/d5/Typhoon-Yagi_5.jpg".

access-control-allow-origin: *
access-control-expose-headers: Age, Date, Content-Length, Content-Range, X-Content-Duration, X-Cache
age: 480
content-length: 85
content-type: text/html; charset=UTF-8
date: Sat, 14 Sep 2024 17:58:37 GMT
nel: { "report_to": "wm_nel", "max_age": 604800, "failure_fraction": 0.05, "success_fraction": 0.0}
report-to: { "group": "wm_nel", "max_age": 604800, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }
server: envoy
server-timing: cache;desc="hit-front", host;desc="cp4052"
strict-transport-security: max-age=106384710; includeSubDomains; preload
timing-allow-origin: *
x-cache: cp4052 miss, cp4052 hit/4
x-cache-status: hit-front
x-content-type-options: nosniff

Today I tried with Edge, and the request displayed the picture.

accept-ranges: bytes
access-control-allow-origin: *
access-control-expose-headers: Age, Date, Content-Length, Content-Range, X-Content-Duration, X-Cache
age: 0
content-length: 7434740
content-type: image/jpeg
date: Sun, 15 Sep 2024 14:55:18 GMT
etag: fe68fa2d2c9fb9101db078cb263815cb
last-modified: Fri, 13 Sep 2024 09:57:44 GMT
nel: { "report_to": "wm_nel", "max_age": 604800, "failure_fraction": 0.05, "success_fraction": 0.0}
report-to: { "group": "wm_nel", "max_age": 604800, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }
server: envoy
server-timing: cache;desc="miss", host;desc="cp1115"
strict-transport-security: max-age=106384710; includeSubDomains; preload
timing-allow-origin: *
x-cache: cp1115 miss, cp1115 miss
x-cache-status: miss
x-content-type-options: nosniff
x-object-meta-sha1base36: l1h10jxvtd5o73z4q51fcqsot4fy2wu

Update: As of 16:00 UTC+8, I can now access the file without problem in Hong Kong. Will like to hear if anyone elsewhere still has trouble in accessing the file?

MatthewVernon claimed this task.
MatthewVernon subscribed.

I've confirmed that both eqiad and codfw swift clusters have this object. They arrived at different times, however:

eqiad:

Last Modified: Fri, 13 Sep 2024 09:57:44 GMT

codfw:

Last Modified: Mon, 16 Sep 2024 05:16:02 GMT

This will be because mw failed to write it to both DCs when it was uploaded on Friday (this happens sometimes, for reasons that are hard to pin down), which left it in this odd state where sometimes you'd see it and sometimes not (depending on how your request got routed). We have a weekly job that catches where mw has left things in an inconsistent state between the two swift clusters and fixes it, and that will have copied the object to codfw this morning. So that's why it now works :)

Aklapper renamed this task from 404 error opening a specific file on Commons to 404 error opening a specific file on Commons (due to inconsistent state between two swift clusters).Sep 16 2024, 9:44 AM