Page MenuHomePhabricator

Thumbnails on beta cluster return 503 Service Unavailable
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

What happens?:
503 Service Unavailable
No server is available to handle this request.

What should have happened instead?:
200

Has been like this for days.

Event Timeline

If someone could purge https://upload.wikimedia.beta.wmflabs.org/wikipedia/en/thumb/1/13/Bert_Self-portrait2.png/200px-Bert_Self-portrait2.png and then visit (regen) https://en.wikipedia.beta.wmflabs.org/w/index.php?title=Special:Redirect/file/Bert_Self-portrait2.png&width=200, that should fix it


nb. we had this a little while back on a random production commons image fwiw, the above resolved it that time

I've followed the steps mentioned by @TheresNoTime but sadly it didn't help at all. Please consider that varnish nor ATS cache 503 errors.

After checking ATS in deployment-cache-upload07 it seems like the 503 is coming from deployment-ms-fe03.deployment-prep.eqiad.wmflabs:

Date:2022-10-26 Time:09:37:25 ConnAttempts:0 ConnReuse:0 TTFetchHeaders:30 OriginServer:deployment-ms-fe03.deployment-prep.eqiad.wmflabs OriginServerTime:28 CacheResultCode:TCP_MISS CacheWriteResult:- ReqMethod:GET RespStatus:503 OriginStatus:503 ReqURL:http://upload.wikimedia.beta.wmflabs.org/wikipedia/en/thumb/1/13/Bert_Self-portrait2.png/200px-Bert_Self-portrait2.png BereqURL:GET http://deployment-ms-fe03.deployment-prep.eqiad.wmflabs/wikipedia/en/thumb/1/13/Bert_Self-portrait2.png/200px-Bert_Self-portrait2.png HTTP/1.1 ReqHeader:User-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36 ReqHeader:Host:upload.wikimedia.beta.wmflabs.org ReqHeader:X-Client-IP:REDACTED ReqHeader:Cookie:- RespHeader:X-Cache-Int:deployment-cache-upload07 miss RespHeader:Backend-Timing:-

Removing Traffic since haproxy/varnish/ATS isn't at fault here.

Thanks @Vgutierrez, I'll take a closer look in a moment, but just noting from deployment-ms-be05:

Oct 26 09:49:16 deployment-ms-be05 object-server: 172.16.5.163 - - [26/Oct/2022:09:49:16 +0000] "GET /lv-a1/39814/AUTH_mw/wikipedia-en-local-thumb.13/1/13/Bert_Self-portrait2.png/200px-Bert_Self-portrait2.png" 404 - "GET http://127.0.0.1/v1/AUTH_mw/wikipedia-en-local-thumb.13/1/13/Bert_Self-portrait2.png/200px-Bert_Self-portrait2.png" "tx881dc7c7d6b640a38b6d6-006359029c" "proxy-server 13694" 0.0018 "-" 28827 0

@TheresNoTime ats-be in deployment-cache-upload07 forwards the requests to deployment-ms-fe03.deployment-prep.eqiad.wmflabs and that one is the service having issues reaching deployment-ms-be05

"If someone could purge.. that should fix it"

Even if it worked, I'd rather not do that whenever I want to see a thumbnail of something. (https://commons.wikimedia.beta.wmflabs.org/wiki/Special:NewFiles only shows existing cached thumbnails, this isn't very practical)

"If someone could purge.. that should fix it"

Even if it worked, I'd rather not do that whenever I want to see a thumbnail of something. (https://commons.wikimedia.beta.wmflabs.org/wiki/Special:NewFiles only shows existing cached thumbnails, this isn't very practical)

They are ways to mass purge all URLs on specific domains if that was needed which could be done by people with the right server access but it doesn't seem like it's an issue with the cache so it doesn't seem like that would make a difference.

samtar@deployment-ms-fe03:~$ swift list wikipedia-en-local-thumb.13
1/13/Bert_Self-portrait2.png/150px-Bert_Self-portrait2.png
1/13/Bert_Self-portrait2.png/170px-Bert_Self-portrait2.png
1/13/Bert_Self-portrait2.png/180px-Bert_Self-portrait2.png
1/13/Bert_Self-portrait2.png/220px-Bert_Self-portrait2.png
1/13/Bert_Self-portrait2.png/255px-Bert_Self-portrait2.png
1/13/Bert_Self-portrait2.png/320px-Bert_Self-portrait2.png
1/13/Bert_Self-portrait2.png/480px-Bert_Self-portrait2.png
1/13/Bert_Self-portrait2.png/60px-Bert_Self-portrait2.png
1/13/Bert_Self-portrait2.png/640px-Bert_Self-portrait2.png
1/13/Bert_Self-portrait2.png/90px-Bert_Self-portrait2.png

200px definitely isn't there in Swift, hmm.

Just noting the things I've tried (unsuccessfully):

The message has changed:

Unauthorized
This server could not verify that you are authorized to access the document you requested.

that message is provided by deployment-ms-fe03, I've captured one of my requests between ats-be in deployment-cache-upload07 and deployment-ms-fe03:

GET /wikipedia/commons/2/24/Water_tank.jpg HTTP/1.1
user-agent: curl/7.74.0
accept: */*
x-client-ip: REDACTED
x-client-port: 43796
x-forwarded-proto: https
x-connection-properties: H2=1; SSR=0; SSL=TLSv1.3; C=TLS_AES_256_GCM_SHA384; EC=UNKNOWN;
X-Forwarded-For: REDACTED, 172.16.0.188
via-nginx: 1
Host: upload.wikimedia.beta.wmflabs.org
X-WMF-NOCOOKIES: 1
X-CDIS: pass
X-Varnish: 794354

HTTP/1.1 401 Unauthorized
Content-Length: 131
Content-Type: text/html; charset=UTF-8
Www-Authenticate: Swift realm="AUTH_mw"
Access-Control-Allow-Origin: *
X-Trans-Id: tx4152382cf15b44ad94d54-0063628c13
Date: Wed, 02 Nov 2022 15:26:13 GMT

<html><h1>Unauthorized</h1><p>This server could not verify that you are authorized to access the document you requested.</p></html>

digging a little bit on swift logs:

Nov  2 15:40:41 deployment-ms-fe03 proxy-server: ERROR with Account server 172.16.7.115:6002/lv-a1 re: Trying to HEAD /v1/AUTH_mw: Host unreachable (txn: tx980a0ddf019b49df9f8a0-0063628f78)
Nov  2 15:40:42 deployment-ms-fe03 proxy-server: ERROR with Account server 172.16.7.114:6002/lv-a1 re: Trying to HEAD /v1/AUTH_mw: ConnectionTimeout (0.5s) (txn: tx980a0ddf019b49df9f8a0-0063628f78

those IPs belong to ms-be05 and ms-be06:

root@deployment-ms-fe03:/var/log/swift# host 172.16.7.115
115.7.16.172.in-addr.arpa domain name pointer deployment-ms-be06.deployment-prep.eqiad1.wikimedia.cloud.
root@deployment-ms-fe03:/var/log/swift# host 172.16.7.114
114.7.16.172.in-addr.arpa domain name pointer deployment-ms-be05.deployment-prep.eqiad1.wikimedia.cloud.

and actually ms-fe03 is unable to reach port 6002:

root@deployment-ms-fe03:/var/log/swift# nc -zv 172.16.7.114 6002
nc: connect to 172.16.7.114 port 6002 (tcp) failed: No route to host
root@deployment-ms-fe03:/var/log/swift# nc -zv 172.16.7.115 6002
nc: connect to 172.16.7.115 port 6002 (tcp) failed: Connection timed out

well... deployment-ms-be05 and deployment-ms-be06 have been powered off.. I'm assuming because those two are running debian stretch 😅

I guess we should spawn deployment-ms-be07 running bullseye (swift in production is happily running in bullseye nowadays)

Bumping this one up a bit as it's broken our testing of Phonos

@TheresNoTime this should be fixed as a side effect of powering the old instances on to be able to add the new instances to the cluster

@TheresNoTime this should be fixed as a side effect of powering the old instances on to be able to add the new instances to the cluster

Just confirmed that (the cause of) T322254: TypeError: Return value of MediaWiki\Extension\Phonos\Engine\Engine::isPersisted() must be of the type bool, null returned is now resolved :) thank you!

Vgutierrez claimed this task.