Thumbnails on beta cluster return 503 Service Unavailable
Closed, ResolvedPublicBUG REPORT
Actions

Assigned To

Authored By

	AlexisJazz
	Oct 26 2022, 9:05 AM

Description

Steps to replicate the issue (include links if applicable):

https://upload.wikimedia.beta.wmflabs.org/wikipedia/en/thumb/1/13/Bert_Self-portrait2.png/200px-Bert_Self-portrait2.png

What happens?:
503 Service Unavailable
No server is available to handle this request.

What should have happened instead?:
200

Has been like this for days.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved	BUG REPORT	Vgutierrez	T321654 Thumbnails on beta cluster return 503 Service Unavailable
		Resolved		Vgutierrez	T322231 Create new deployment-ms-be instances running Debian Bullseye

Event Timeline

AlexisJazz created this task.Oct 26 2022, 9:05 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 26 2022, 9:05 AM

TheresNoTime subscribed.Oct 26 2022, 9:22 AM

Maintenance_bot added a project: SRE.Oct 26 2022, 9:29 AM

If someone could purge https://upload.wikimedia.beta.wmflabs.org/wikipedia/en/thumb/1/13/Bert_Self-portrait2.png/200px-Bert_Self-portrait2.png and then visit (regen) https://en.wikipedia.beta.wmflabs.org/w/index.php?title=Special:Redirect/file/Bert_Self-portrait2.png&width=200, that should fix it

nb. we had this a little while back on a random production commons image fwiw, the above resolved it that time

I've followed the steps mentioned by @TheresNoTime but sadly it didn't help at all. Please consider that varnish nor ATS cache 503 errors.

After checking ATS in deployment-cache-upload07 it seems like the 503 is coming from deployment-ms-fe03.deployment-prep.eqiad.wmflabs:

Date:2022-10-26 Time:09:37:25 ConnAttempts:0 ConnReuse:0 TTFetchHeaders:30 OriginServer:deployment-ms-fe03.deployment-prep.eqiad.wmflabs OriginServerTime:28 CacheResultCode:TCP_MISS CacheWriteResult:- ReqMethod:GET RespStatus:503 OriginStatus:503 ReqURL:http://upload.wikimedia.beta.wmflabs.org/wikipedia/en/thumb/1/13/Bert_Self-portrait2.png/200px-Bert_Self-portrait2.png BereqURL:GET http://deployment-ms-fe03.deployment-prep.eqiad.wmflabs/wikipedia/en/thumb/1/13/Bert_Self-portrait2.png/200px-Bert_Self-portrait2.png HTTP/1.1 ReqHeader:User-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36 ReqHeader:Host:upload.wikimedia.beta.wmflabs.org ReqHeader:X-Client-IP:REDACTED ReqHeader:Cookie:- RespHeader:X-Cache-Int:deployment-cache-upload07 miss RespHeader:Backend-Timing:-

Removing Traffic since haproxy/varnish/ATS isn't at fault here.

Thanks @Vgutierrez, I'll take a closer look in a moment, but just noting from deployment-ms-be05:

Oct 26 09:49:16 deployment-ms-be05 object-server: 172.16.5.163 - - [26/Oct/2022:09:49:16 +0000] "GET /lv-a1/39814/AUTH_mw/wikipedia-en-local-thumb.13/1/13/Bert_Self-portrait2.png/200px-Bert_Self-portrait2.png" 404 - "GET http://127.0.0.1/v1/AUTH_mw/wikipedia-en-local-thumb.13/1/13/Bert_Self-portrait2.png/200px-Bert_Self-portrait2.png" "tx881dc7c7d6b640a38b6d6-006359029c" "proxy-server 13694" 0.0018 "-" 28827 0

@TheresNoTime ats-be in deployment-cache-upload07 forwards the requests to deployment-ms-fe03.deployment-prep.eqiad.wmflabs and that one is the service having issues reaching deployment-ms-be05

"If someone could purge.. that should fix it"

Even if it worked, I'd rather not do that whenever I want to see a thumbnail of something. (https://commons.wikimedia.beta.wmflabs.org/wiki/Special:NewFiles only shows existing cached thumbnails, this isn't very practical)

In T321654#8344934, @AlexisJazz wrote:

"If someone could purge.. that should fix it"

Even if it worked, I'd rather not do that whenever I want to see a thumbnail of something. (https://commons.wikimedia.beta.wmflabs.org/wiki/Special:NewFiles only shows existing cached thumbnails, this isn't very practical)

They are ways to mass purge all URLs on specific domains if that was needed which could be done by people with the right server access but it doesn't seem like it's an issue with the cache so it doesn't seem like that would make a difference.

samtar@deployment-ms-fe03:~$ swift list wikipedia-en-local-thumb.13
1/13/Bert_Self-portrait2.png/150px-Bert_Self-portrait2.png
1/13/Bert_Self-portrait2.png/170px-Bert_Self-portrait2.png
1/13/Bert_Self-portrait2.png/180px-Bert_Self-portrait2.png
1/13/Bert_Self-portrait2.png/220px-Bert_Self-portrait2.png
1/13/Bert_Self-portrait2.png/255px-Bert_Self-portrait2.png
1/13/Bert_Self-portrait2.png/320px-Bert_Self-portrait2.png
1/13/Bert_Self-portrait2.png/480px-Bert_Self-portrait2.png
1/13/Bert_Self-portrait2.png/60px-Bert_Self-portrait2.png
1/13/Bert_Self-portrait2.png/640px-Bert_Self-portrait2.png
1/13/Bert_Self-portrait2.png/90px-Bert_Self-portrait2.png

200px definitely isn't there in Swift, hmm.

Just noting the things I've tried (unsuccessfully):

Purging the varnish cache for this file
Deleting some generated thumbnails for this file (as detailed in https://wikitech.wikimedia.org/wiki/Swift/How_To)

jbond edited projects, added serviceops; removed SRE.Nov 2 2022, 11:41 AM

The message has changed:

Unauthorized
This server could not verify that you are authorized to access the document you requested.

that message is provided by deployment-ms-fe03, I've captured one of my requests between ats-be in deployment-cache-upload07 and deployment-ms-fe03:

GET /wikipedia/commons/2/24/Water_tank.jpg HTTP/1.1
user-agent: curl/7.74.0
accept: */*
x-client-ip: REDACTED
x-client-port: 43796
x-forwarded-proto: https
x-connection-properties: H2=1; SSR=0; SSL=TLSv1.3; C=TLS_AES_256_GCM_SHA384; EC=UNKNOWN;
X-Forwarded-For: REDACTED, 172.16.0.188
via-nginx: 1
Host: upload.wikimedia.beta.wmflabs.org
X-WMF-NOCOOKIES: 1
X-CDIS: pass
X-Varnish: 794354

HTTP/1.1 401 Unauthorized
Content-Length: 131
Content-Type: text/html; charset=UTF-8
Www-Authenticate: Swift realm="AUTH_mw"
Access-Control-Allow-Origin: *
X-Trans-Id: tx4152382cf15b44ad94d54-0063628c13
Date: Wed, 02 Nov 2022 15:26:13 GMT

<html><h1>Unauthorized</h1><p>This server could not verify that you are authorized to access the document you requested.</p></html>

digging a little bit on swift logs:

Nov  2 15:40:41 deployment-ms-fe03 proxy-server: ERROR with Account server 172.16.7.115:6002/lv-a1 re: Trying to HEAD /v1/AUTH_mw: Host unreachable (txn: tx980a0ddf019b49df9f8a0-0063628f78)
Nov  2 15:40:42 deployment-ms-fe03 proxy-server: ERROR with Account server 172.16.7.114:6002/lv-a1 re: Trying to HEAD /v1/AUTH_mw: ConnectionTimeout (0.5s) (txn: tx980a0ddf019b49df9f8a0-0063628f78

those IPs belong to ms-be05 and ms-be06:

root@deployment-ms-fe03:/var/log/swift# host 172.16.7.115
115.7.16.172.in-addr.arpa domain name pointer deployment-ms-be06.deployment-prep.eqiad1.wikimedia.cloud.
root@deployment-ms-fe03:/var/log/swift# host 172.16.7.114
114.7.16.172.in-addr.arpa domain name pointer deployment-ms-be05.deployment-prep.eqiad1.wikimedia.cloud.

and actually ms-fe03 is unable to reach port 6002:

root@deployment-ms-fe03:/var/log/swift# nc -zv 172.16.7.114 6002
nc: connect to 172.16.7.114 port 6002 (tcp) failed: No route to host
root@deployment-ms-fe03:/var/log/swift# nc -zv 172.16.7.115 6002
nc: connect to 172.16.7.115 port 6002 (tcp) failed: Connection timed out

well... deployment-ms-be05 and deployment-ms-be06 have been powered off.. I'm assuming because those two are running debian stretch 😅

I guess we should spawn deployment-ms-be07 running bullseye (swift in production is happily running in bullseye nowadays)

Vgutierrez changed the status of subtask T322231: Create new deployment-ms-be instances running Debian Bullseye from Open to In Progress.Nov 2 2022, 5:11 PM

Bumping this one up a bit as it's broken our testing of Phonos

TheresNoTime mentioned this in T322254: TypeError: Return value of MediaWiki\Extension\Phonos\Engine\Engine::isPersisted() must be of the type bool, null returned.Nov 2 2022, 6:53 PM

@TheresNoTime this should be fixed as a side effect of powering the old instances on to be able to add the new instances to the cluster

In T321654#8364815, @Vgutierrez wrote:

@TheresNoTime this should be fixed as a side effect of powering the old instances on to be able to add the new instances to the cluster

Just confirmed that (the cause of) T322254: TypeError: Return value of MediaWiki\Extension\Phonos\Engine\Engine::isPersisted() must be of the type bool, null returned is now resolved :) thank you!

Vgutierrez closed this task as Resolved.Nov 2 2022, 8:29 PM

Vgutierrez claimed this task.

Vgutierrez changed the status of subtask T322231: Create new deployment-ms-be instances running Debian Bullseye from In Progress to Stalled.Nov 3 2022, 11:56 AM

Vgutierrez closed subtask T322231: Create new deployment-ms-be instances running Debian Bullseye as Resolved.Nov 8 2022, 3:26 PM

RhinosF1 mentioned this in T323152: Thumbnails not appearing in search on the beta cluster.Nov 16 2022, 6:08 AM