Page MenuHomePhabricator

Uploads fail due to 401 error from swift on wednesdays
Closed, ResolvedPublic

Description

I noticed recently in logs an upload failed due to 401 error in swift

Log entry (req id 98f9e25c-9b70-478a-86cc-d4fd52680b90 ):

Feb 29, 2024 @ 16:03:07 HTTP 401 (Unauthorized) in 'SwiftFileBackend::doGetFileStatMulti' (given '{"srcs":["mwstore://local-swift-codfw/local-temp/7/72/20240229155755!chunkedupload_cb23bd647e85.pdf"],"concurrency":1}')
Feb 29, 2024 @ 16:03:07 HTTP 401 (Unauthorized) in 'SwiftFileBackend::doGetFileStatMulti' (given '{"srcs":["mwstore://local-swift-codfw/local-temp/7/72/20240229155755!chunkedupload_cb23bd647e85.pdf"],"concurrency":50}')
Feb 29, 2024 @ 16:03:07 HTTP 401 (Unauthorized) in 'SwiftFileBackend::doGetFileStatMulti' (given '{"srcs":["mwstore://local-swift-codfw/local-public/archive/6/64/20240229160249!\u65b0\u9078\u5404\u540d\u516c\u91d1\u7389\u5c0d\u806f.pdf"],"concurrency":1}')
Feb 29, 2024 @ 16:03:07 HTTP 401 (Unauthorized) in 'SwiftFileBackend::doGetFileStatMulti' (given '{"srcs":["mwstore://local-swift-codfw/local-public/archive/6/64/20240229160249!\u65b0\u9078\u5404\u540d\u516c\u91d1\u7389\u5c0d\u806f.pdf"],"concurrency":50}')

The fact it happened multiple times in the same request suggest that MW might have been reusing bad credentials.

This seems pretty surprising. Long term, I'd like to make the publish job be auto retried on failure, which might make things like this not be an issue if they are transient. Nonetheless filing this, because it sounds like something that probably shouldn't happen.

Looking at broader logs https://logstash.wikimedia.org/goto/0fabf2fcc66a2fa897a042de4cb2489b - it seems like this issue happens on wednesdays. Maybe connected to the deploy somehow?

Event Timeline

Bawolff renamed this task from Uploads fail due to 401 error from swift to Uploads fail due to 401 error from swift on wednesdays.Mar 1 2024, 4:07 AM
Bawolff updated the task description. (Show Details)

There are some information about the 401 status in T228292#7490101
There is also T206252: Spike of HTTP errors from SwiftFileBackend::doStoreInternal, which could be related, but other function

The tempauth expiry time is 7 days. MW considers the token to be expired after 7.5 minutes of caching, but Swift just gives it the same token every time, with a shorter and shorter remaining lifetime. Every server gets the same token -- it only varies by user, so the whole of MediaWiki gets a single token. Every 7 days, the token expires, and there is a burst of errors.

The current token in eqiad expires Wednesday 2024-03-13 14:47:20. The codfw token will expire the next day at 16:03:02.

Change 1010344 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] filebackend: Retry Swift requests with new auth token on 401

https://gerrit.wikimedia.org/r/1010344

Krinkle subscribed.

Tagging MwEng group for visibility, given unowned code.

Change 1010344 merged by jenkins-bot:

[mediawiki/core@master] filebackend: Retry Swift requests with new auth token on 401

https://gerrit.wikimedia.org/r/1010344