Steps to replicate the issue (include links if applicable):
What happens?:
503 Service Unavailable
No server is available to handle this request.
What should have happened instead?:
200
Has been like this for days.
Steps to replicate the issue (include links if applicable):
What happens?:
503 Service Unavailable
No server is available to handle this request.
What should have happened instead?:
200
Has been like this for days.
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | BUG REPORT | Vgutierrez | T321654 Thumbnails on beta cluster return 503 Service Unavailable | ||
Resolved | Vgutierrez | T322231 Create new deployment-ms-be instances running Debian Bullseye |
If someone could purge https://upload.wikimedia.beta.wmflabs.org/wikipedia/en/thumb/1/13/Bert_Self-portrait2.png/200px-Bert_Self-portrait2.png and then visit (regen) https://en.wikipedia.beta.wmflabs.org/w/index.php?title=Special:Redirect/file/Bert_Self-portrait2.png&width=200, that should fix it
nb. we had this a little while back on a random production commons image fwiw, the above resolved it that time
I've followed the steps mentioned by @TheresNoTime but sadly it didn't help at all. Please consider that varnish nor ATS cache 503 errors.
After checking ATS in deployment-cache-upload07 it seems like the 503 is coming from deployment-ms-fe03.deployment-prep.eqiad.wmflabs:
Date:2022-10-26 Time:09:37:25 ConnAttempts:0 ConnReuse:0 TTFetchHeaders:30 OriginServer:deployment-ms-fe03.deployment-prep.eqiad.wmflabs OriginServerTime:28 CacheResultCode:TCP_MISS CacheWriteResult:- ReqMethod:GET RespStatus:503 OriginStatus:503 ReqURL:http://upload.wikimedia.beta.wmflabs.org/wikipedia/en/thumb/1/13/Bert_Self-portrait2.png/200px-Bert_Self-portrait2.png BereqURL:GET http://deployment-ms-fe03.deployment-prep.eqiad.wmflabs/wikipedia/en/thumb/1/13/Bert_Self-portrait2.png/200px-Bert_Self-portrait2.png HTTP/1.1 ReqHeader:User-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36 ReqHeader:Host:upload.wikimedia.beta.wmflabs.org ReqHeader:X-Client-IP:REDACTED ReqHeader:Cookie:- RespHeader:X-Cache-Int:deployment-cache-upload07 miss RespHeader:Backend-Timing:-
Thanks @Vgutierrez, I'll take a closer look in a moment, but just noting from deployment-ms-be05:
Oct 26 09:49:16 deployment-ms-be05 object-server: 172.16.5.163 - - [26/Oct/2022:09:49:16 +0000] "GET /lv-a1/39814/AUTH_mw/wikipedia-en-local-thumb.13/1/13/Bert_Self-portrait2.png/200px-Bert_Self-portrait2.png" 404 - "GET http://127.0.0.1/v1/AUTH_mw/wikipedia-en-local-thumb.13/1/13/Bert_Self-portrait2.png/200px-Bert_Self-portrait2.png" "tx881dc7c7d6b640a38b6d6-006359029c" "proxy-server 13694" 0.0018 "-" 28827 0
@TheresNoTime ats-be in deployment-cache-upload07 forwards the requests to deployment-ms-fe03.deployment-prep.eqiad.wmflabs and that one is the service having issues reaching deployment-ms-be05
"If someone could purge.. that should fix it"
Even if it worked, I'd rather not do that whenever I want to see a thumbnail of something. (https://commons.wikimedia.beta.wmflabs.org/wiki/Special:NewFiles only shows existing cached thumbnails, this isn't very practical)
They are ways to mass purge all URLs on specific domains if that was needed which could be done by people with the right server access but it doesn't seem like it's an issue with the cache so it doesn't seem like that would make a difference.
samtar@deployment-ms-fe03:~$ swift list wikipedia-en-local-thumb.13 1/13/Bert_Self-portrait2.png/150px-Bert_Self-portrait2.png 1/13/Bert_Self-portrait2.png/170px-Bert_Self-portrait2.png 1/13/Bert_Self-portrait2.png/180px-Bert_Self-portrait2.png 1/13/Bert_Self-portrait2.png/220px-Bert_Self-portrait2.png 1/13/Bert_Self-portrait2.png/255px-Bert_Self-portrait2.png 1/13/Bert_Self-portrait2.png/320px-Bert_Self-portrait2.png 1/13/Bert_Self-portrait2.png/480px-Bert_Self-portrait2.png 1/13/Bert_Self-portrait2.png/60px-Bert_Self-portrait2.png 1/13/Bert_Self-portrait2.png/640px-Bert_Self-portrait2.png 1/13/Bert_Self-portrait2.png/90px-Bert_Self-portrait2.png
200px definitely isn't there in Swift, hmm.
Just noting the things I've tried (unsuccessfully):
The message has changed:
Unauthorized
This server could not verify that you are authorized to access the document you requested.
that message is provided by deployment-ms-fe03, I've captured one of my requests between ats-be in deployment-cache-upload07 and deployment-ms-fe03:
GET /wikipedia/commons/2/24/Water_tank.jpg HTTP/1.1 user-agent: curl/7.74.0 accept: */* x-client-ip: REDACTED x-client-port: 43796 x-forwarded-proto: https x-connection-properties: H2=1; SSR=0; SSL=TLSv1.3; C=TLS_AES_256_GCM_SHA384; EC=UNKNOWN; X-Forwarded-For: REDACTED, 172.16.0.188 via-nginx: 1 Host: upload.wikimedia.beta.wmflabs.org X-WMF-NOCOOKIES: 1 X-CDIS: pass X-Varnish: 794354 HTTP/1.1 401 Unauthorized Content-Length: 131 Content-Type: text/html; charset=UTF-8 Www-Authenticate: Swift realm="AUTH_mw" Access-Control-Allow-Origin: * X-Trans-Id: tx4152382cf15b44ad94d54-0063628c13 Date: Wed, 02 Nov 2022 15:26:13 GMT <html><h1>Unauthorized</h1><p>This server could not verify that you are authorized to access the document you requested.</p></html>
digging a little bit on swift logs:
Nov 2 15:40:41 deployment-ms-fe03 proxy-server: ERROR with Account server 172.16.7.115:6002/lv-a1 re: Trying to HEAD /v1/AUTH_mw: Host unreachable (txn: tx980a0ddf019b49df9f8a0-0063628f78) Nov 2 15:40:42 deployment-ms-fe03 proxy-server: ERROR with Account server 172.16.7.114:6002/lv-a1 re: Trying to HEAD /v1/AUTH_mw: ConnectionTimeout (0.5s) (txn: tx980a0ddf019b49df9f8a0-0063628f78
those IPs belong to ms-be05 and ms-be06:
root@deployment-ms-fe03:/var/log/swift# host 172.16.7.115 115.7.16.172.in-addr.arpa domain name pointer deployment-ms-be06.deployment-prep.eqiad1.wikimedia.cloud. root@deployment-ms-fe03:/var/log/swift# host 172.16.7.114 114.7.16.172.in-addr.arpa domain name pointer deployment-ms-be05.deployment-prep.eqiad1.wikimedia.cloud.
and actually ms-fe03 is unable to reach port 6002:
root@deployment-ms-fe03:/var/log/swift# nc -zv 172.16.7.114 6002 nc: connect to 172.16.7.114 port 6002 (tcp) failed: No route to host root@deployment-ms-fe03:/var/log/swift# nc -zv 172.16.7.115 6002 nc: connect to 172.16.7.115 port 6002 (tcp) failed: Connection timed out
well... deployment-ms-be05 and deployment-ms-be06 have been powered off.. I'm assuming because those two are running debian stretch 😅
I guess we should spawn deployment-ms-be07 running bullseye (swift in production is happily running in bullseye nowadays)
@TheresNoTime this should be fixed as a side effect of powering the old instances on to be able to add the new instances to the cluster
Just confirmed that (the cause of) T322254: TypeError: Return value of MediaWiki\Extension\Phonos\Engine\Engine::isPersisted() must be of the type bool, null returned is now resolved :) thank you!