Page MenuHomePhabricator

API uploads fatal with UploadChunkFileException: Error storing file in '/tmp' backend-fail-internal
Open, MediumPublicPRODUCTION ERROR

Description

Error

Request URL:
Request ID: INSERT_ID

message
UploadChunkFileException: Error storing file in '/tmp/phpYHAPWZ': backend-fail-internal; local-swift-codfw
trace
#0 /srv/mediawiki/php-1.34.0-wmf.14/includes/upload/UploadFromChunks.php(275): UploadFromChunks->outputChunk(string)
#1 /srv/mediawiki/php-1.34.0-wmf.14/includes/api/ApiUpload.php(226): UploadFromChunks->addChunk(string, integer, integer)
#2 /srv/mediawiki/php-1.34.0-wmf.14/includes/api/ApiUpload.php(132): ApiUpload->getChunkResult(array)
#3 /srv/mediawiki/php-1.34.0-wmf.14/includes/api/ApiUpload.php(104): ApiUpload->getContextResult()
#4 /srv/mediawiki/php-1.34.0-wmf.14/includes/api/ApiMain.php(1583): ApiUpload->execute()
#5 /srv/mediawiki/php-1.34.0-wmf.14/includes/api/ApiMain.php(531): ApiMain->executeAction()
#6 /srv/mediawiki/php-1.34.0-wmf.14/includes/api/ApiMain.php(502): ApiMain->executeActionWithErrorHandling()
#7 /srv/mediawiki/php-1.34.0-wmf.14/api.php(86): ApiMain->execute()
#8 /srv/mediawiki/w/api.php(3): require(string)
#9 {main}

Impact

Unknown. Special:NewFiles still shows new files being uploaded, so at least it’s not preventing all uploads.

Notes

From logstash:

  • New in 1.34-wmf.14.
  • Affects commons.wikimedia.org (naturally).
  • Seen several dozen times already in the short time it's been out.

Event Timeline

LarsWirzenius triaged this task as Unbreak Now! priority.Jul 17 2019, 3:55 PM
Cparle added a subscriber: fgiunchedi.
Cparle added a subscriber: Cparle.

@fgiunchedi I tagged you cos @Gilles is away and I dunno who else to ask about swift ...

:D afaik we've been working almost exclusively on js/ui stuff lately, so I don't think it's us

Adding SRE per SRE-swift-storage / @fgiunchedi

(There's no tag for the Infrastructure Foundations subteam of SRE is there?)

fgiunchedi lowered the priority of this task from Unbreak Now! to Medium.Jul 18 2019, 8:29 AM

The errors from UploadChunkFileException: https://logstash.wikimedia.org/goto/ce40b31903aa613bce0ec93c9934e5f4

Searching for local-swift-codfw on the same time period: https://logstash.wikimedia.org/goto/fe3047f68c4b368079d6845b0dfccbe9

There's a bunch of errors in this form over four minutes

2019-07-17T14:55:39	mw1230	ERROR	HTTP 401 (Unauthorized) in 'SwiftFileBackend::doStoreInternal' (given '{"async":false,"op":"store","src":"/tmp/phpYHAPWZ","dst":"mwstore://local-swift-codfw/local-temp/d/d3/16rd13foxoo4.7etxt1.2927633.jpg.1","headers":[],"overwrite":true}')

Which I believe are due to MW's authentication token to swift expiring, I'm not sure if there's logic to retry and refresh the auth in cases like this though. I doubt it is a newly introduced bug, thus I'm boldly setting priority to normal, not a train blocker IMHO.

@fgiunchedi If this is not blocking the train, please remove the train task from parent tasks.

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:06 PM

This is a 1y+ production error still waiting to be investigated. There is some reason to suspect it might be infrastructure related, but before SRE can help here it will first need to be better understood and quantified what goes wrong in Swift (if indeed that's the case).

Error
normalized_message
[{reqId}] {exception_url}   UploadChunkFileException: Error storing file in '/tmp/phpsqQtdL': backend-fail-internal; local-swift-codfw
exception.trace
from /srv/mediawiki/php-1.38.0-wmf.7/includes/upload/UploadFromChunks.php(358)
#0 /srv/mediawiki/php-1.38.0-wmf.7/includes/upload/UploadFromChunks.php(247): UploadFromChunks->outputChunk(string)
#1 /srv/mediawiki/php-1.38.0-wmf.7/includes/api/ApiUpload.php(274): UploadFromChunks->addChunk(string, integer, integer)
#2 /srv/mediawiki/php-1.38.0-wmf.7/includes/api/ApiUpload.php(153): ApiUpload->getChunkResult(array)
#3 /srv/mediawiki/php-1.38.0-wmf.7/includes/api/ApiUpload.php(124): ApiUpload->getContextResult()
#4 /srv/mediawiki/php-1.38.0-wmf.7/includes/api/ApiMain.php(1888): ApiUpload->execute()
#5 /srv/mediawiki/php-1.38.0-wmf.7/includes/api/ApiMain.php(867): ApiMain->executeAction()
#6 /srv/mediawiki/php-1.38.0-wmf.7/includes/api/ApiMain.php(838): ApiMain->executeActionWithErrorHandling()
#7 /srv/mediawiki/php-1.38.0-wmf.7/api.php(90): ApiMain->execute()
#8 /srv/mediawiki/php-1.38.0-wmf.7/api.php(45): wfApiMain()
#9 /srv/mediawiki/w/api.php(3): require(string)
#10 {main}
Notes
  • Still happening
  • Averaging about once per day, but happens in bursts

The errors from UploadChunkFileException: https://logstash.wikimedia.org/goto/ce40b31903aa613bce0ec93c9934e5f4

Searching for local-swift-codfw on the same time period: https://logstash.wikimedia.org/goto/fe3047f68c4b368079d6845b0dfccbe9

There's a bunch of errors in this form over four minutes

2019-07-17T14:55:39	mw1230	ERROR	HTTP 401 (Unauthorized) in 'SwiftFileBackend::doStoreInternal' (given '{"async":false,"op":"store","src":"/tmp/phpYHAPWZ","dst":"mwstore://local-swift-codfw/local-temp/d/d3/16rd13foxoo4.7etxt1.2927633.jpg.1","headers":[],"overwrite":true}')

Which I believe are due to MW's authentication token to swift expiring, I'm not sure if there's logic to retry and refresh the auth in cases like this though. I doubt it is a newly introduced bug, thus I'm boldly setting priority to normal, not a train blocker IMHO.

MediaWiki does not retry requests on 401s, it just invalidates the cached token so the next request will then request a new token. I guess it's expecting the client to retry? I don't think that's a good practice though. There's some complicated looking logic to set the expiry of that cached token to be lower than Swift's expiry. I don't really understand why it doesn't just rely on the X-Auth-Token-Expires provided by Swift. I do see one easy logic bug though, related to $reAuth being ignored and still hitting the cache.

Also the token is stored in APCu, which means when it does expire, there's going to be a storm as every API server needs to invalidate it individually (note: it looks like it expires after 7 days, so idk why it would happen every day).

Based on reading https://docs.openstack.org/swift/latest/overview_auth.html#overview and manual experimentation, it's not possible to get the new token until the old one expires. So really just implementing retries is the way to go.