Page MenuHomePhabricator

Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw
Open, Needs TriagePublicPRODUCTION ERROR

Description

I get

00024: FAILED: internal_api_error_UploadChunkFileException: [9e95f0c8-9cd8-4daf-93bf-996b77705f13] Caught exception of type UploadChunkFileException

while uploading a new version of https://commons.wikimedia.org/wiki/File:The_R%C3%A1m%C3%A1yana_of_Tulsi_D%C3%A1s.djvu
using https://commons.wikimedia.org/wiki/User:Rillke/bigChunkedUpload.js

from /srv/mediawiki/php-1.40.0-wmf.21/includes/upload/UploadFromChunks.php(359)
#0 /srv/mediawiki/php-1.40.0-wmf.21/includes/upload/UploadFromChunks.php(248): UploadFromChunks->outputChunk(string)
#1 /srv/mediawiki/php-1.40.0-wmf.21/includes/api/ApiUpload.php(278): UploadFromChunks->addChunk(string, integer, integer)
#2 /srv/mediawiki/php-1.40.0-wmf.21/includes/api/ApiUpload.php(157): ApiUpload->getChunkResult(array)
#3 /srv/mediawiki/php-1.40.0-wmf.21/includes/api/ApiUpload.php(128): ApiUpload->getContextResult()
#4 /srv/mediawiki/php-1.40.0-wmf.21/includes/api/ApiMain.php(1901): ApiUpload->execute()
#5 /srv/mediawiki/php-1.40.0-wmf.21/includes/api/ApiMain.php(878): ApiMain->executeAction()
#6 /srv/mediawiki/php-1.40.0-wmf.21/includes/api/ApiMain.php(849): ApiMain->executeActionWithErrorHandling()
#7 /srv/mediawiki/php-1.40.0-wmf.21/api.php(90): ApiMain->execute()
#8 /srv/mediawiki/php-1.40.0-wmf.21/api.php(45): wfApiMain()
#9 /srv/mediawiki/w/api.php(3): require(string)
#10 {main}

Details

MediaWiki Version
1.40.0-wmf.21
Request URL
https://commons.wikimedia.org/wiki/File:The_R%C3%A1m%C3%A1yana_of_Tulsi_D%C3%A1s.djvu

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Can you explain the impact of this bug more clearly? Is this a somewhat random error that just requires repeating the upload? Is chunked upload deterministically broken for a given file? Is there some kind of permanent damage (such as that title left in a broken state)? What fraction of uploads is affected? Does this cause a significant time loss or bandwidth cost?

It appears to be somewhat random, however it throws an exception that breaks my script and I have to keep restarting it. So yes, it is a significant time loss.

Can you explain the impact of this bug more clearly? Is this a somewhat random error that just requires repeating the upload? Is chunked upload deterministically broken for a given file? Is there some kind of permanent damage (such as that title left in a broken state)? What fraction of uploads is affected? Does this cause a significant time loss or bandwidth cost?

Hi, Thank you all for looking into this. It is quite random, and statistically it appears after 10 to 40 chunks, so any file bigger than 4 x 40 = 160 MB is sure to fail. Upload has to be started again, so yes, it is a significant time and bandwidth loss. Surprisingly https://commons.wikimedia.org/wiki/File:Index_librorum_prohibitorum_Leonis_XIII_sum_pont_auctoritate_recognitus_SSmi,_1922.pdf came through after several attempts.

I managed to upload one file in increasing the chunk size to 9 MB, so it decreases the number of chunks. I also managed to recover 2 files from Special:UploadStash.
This bug would not be so bad if it would be possible to start again where the chunk failed.

I tried again https://commons.wikimedia.org/wiki/File:Gandhi_-_Young_India,_v._4,_1922.pdf (1150 MB) increasing the chunk size to 8 MB (which worked for medium size files). It failed at 41 chunks:

01124: 41/141> upload in progress Upload: 100%
01218: FAILED: internal_api_error_UploadChunkFileException: [ebf1c1e8-0188-42f7-8c94-68b816a9c2e2] Caught exception of type UploadChunkFileException

It appears to be somewhat random, however it throws an exception that breaks my script and I have to keep restarting it. So yes, it is a significant time loss.

Is that a pywikipedia script? Seems like it might be easier to fix the script, in the short term.

This bug would not be so bad if it would be possible to start again where the chunk failed.

The error handling logic for the script is here and the configuration here. It does retry failed chunks, up to 250 times. I have a hard time believing the average upload would
error out more than 250 times, so maybe there is a bug in the retry logic?

No idea how to test it though. Uploading multi-gigabyte test files to Commons doesn't seem like a great idea.

Don't know if it helps but I just got this error with CropTool for my first attempt at creating https://commons.wikimedia.org/wiki/File:Aurora_Mayor_and_council_members_visit_Buckley_AFB_(4734201)_(cropped).jpeg

I tried again a few seconds later and it worked.

I am a bit concerned about the small rise in 50x codes reported on the ATS - backend dashboard since about 13:00:00 on 2023-03-09, but that doesn't match your time at all, and I can't find any relevant change that happened then; it is seen in both codfw and eqiad, too. Grepping proxy-access.log suggests that nearly all of the (at least today's) 50x codes from swift related to thumbnails, which makes me wonder if thumbor is unhappy again.

For reference, these are all GET or HEAD, so are irrelevant to this issue.

This bug would not be so bad if it would be possible to start again where the chunk failed.

The error handling logic for the script is here and the configuration here. It does retry failed chunks, up to 250 times. I have a hard time believing the average upload would
error out more than 250 times, so maybe there is a bug in the retry logic?

No idea how to test it though. Uploading multi-gigabyte test files to Commons doesn't seem like a great idea.

Well, the code works fine if there is a connection issue on my side. In that case, it tries again until the connection gets back.
But here the issue is different. The error comes from the server side, and the process hangs with the error message.

Not sure if it's related, but I keep getting this error when uploading new version of files.

image.png (465×1 px, 67 KB)
It might be a generic issue, not a chunk upload one? (Also see my earlier comment on this when uploading images through Lightroom, also not sure that's chunk upload?)

Logs show that the issue started at 2023-03-07 14:23:35 and it fixed itself shortly after 2023-03-13 17:08:35.

UploadChunkFileException frequency.png (615×1 px, 110 KB)

SAL entry associated with the end of the incident:

17:16 mvernon@cumin1001: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:codfw and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad)

There are more details about the errors in the FileOperation channel, for example reqId 6e7c6a99-03f7-4f7b-9bb2-c3e3da0a76bc

HTTP 502 (Bad Gateway) in 'SwiftFileBackend::doStoreInternal' (given '{"async":false,"src":"/tmp/phpJAdHEm","dst":"mwstore://local-swift-codfw/local-temp/d/dc/19wxlydr879g.mkokap.6179471.webm.4","overwrite":true,"headers":[]}')

98% of the FileOperation log entries with a "code" field were 502. Most 502 errors did not lead to an UploadChunkFileException. The number of FileOperation 502 errors per day:

2023-03-061
2023-03-072255
2023-03-085665
2023-03-0922678
2023-03-1038292
2023-03-1146887
2023-03-1244481
2023-03-1332649
2023-03-14582

On the Swift dashboard you can see these errors grouped under "proxy-server errors" (swift_proxy_server_errors_total), although there is a background rate of errors in that metric which do not correspond to MediaWiki FileOperation log entries.

proxy-server errors.png (233×1 px, 60 KB)

Breaking down swift_proxy_server_errors_total by instance, you can see that ms-fe2009 was at fault. There is no obvious explanation in its server.log. There's not enough 502 errors in its proxy-access.log.

Change 899885 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/puppet@production] Unprovision the "swift" dashboard

https://gerrit.wikimedia.org/r/899885

I think this issue is resolved now? Swift error rates have remained low since we restarted swift and thumbor.

@MatthewVernon I think so. Although it guess it might be worth evaluating if it is possible to make incidents like this somehow more visible and detectable ?

Hi, This seems to work OK for me now. Thanks for fixing it.

MatthewVernon claimed this task.

Change 899885 merged by Tim Starling:

[operations/puppet@production] Unprovision the "swift" dashboard

https://gerrit.wikimedia.org/r/899885

ms-fe1009 and ms-fe2009 account for almost all errors, according to the swift_proxy_server_errors_total metric. They are also the only servers running Debian Stretch and Swift 2.10, all the others have bullseye and Swift 2.26. I don't know whether they are actually broken or if the newer Swift just doesn't report its errors. I see there is T279637 tracking the upgrade.

Change 902064 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] Provision the revised Swift dashboard

https://gerrit.wikimedia.org/r/902064

Change 902064 merged by MVernon:

[operations/puppet@production] Provision the revised Swift dashboard

https://gerrit.wikimedia.org/r/902064

Yann reopened this task as Open.EditedApr 4 2023, 10:11 PM

Hi, Sorry, but it happened again while uploading https://commons.wikimedia.org/wiki/File:Dostoevsky_-_The_Gambler_and_Other_Stories,_Collected_Edition,_1917.pdf with chunked uploads.

01217: 27/38> in progress Upload: 100%
01271: FAILED: internal_api_error_UploadChunkFileException: [359de628-b095-48ac-af48-2db85dcebc18] Caught exception of type UploadChunkFileException

Chrome console:

index.php?action=raw&ctype=text/javascript&title=User:Rillke/MwJSBot.js&_=2:1136 mwbot> stashfailed: An unknown error occurred in storage backend "local-swift-eqiad". {error: {…}, servedby: 'mw2324'} success {readyState: 4, getResponseHeader: ƒ, getAllResponseHeaders: ƒ, setRequestHeader: ƒ, overrideMimeType: ƒ, …}

There was no isolation or resolution of root causes, so we can expect the issue to recur periodically.

I would love to see the HTTP error response body. FileOperation logs show 502 errors, but the Swift proxy access logs do not. I tried to capture 502 error responses coming out of swift-proxy on ms-fe2*, but I didn't find any. I found some 500s, forwarded from Thumbor, but not 502. So maybe the 502s are generated by nginx. Access logging is disabled in nginx.

Mentioned in SAL (#wikimedia-operations) [2023-04-05T04:17:22Z] <TimStarling> restarted swift-proxy on ms-fe* T328872

Restarting the proxy servers temporarily fixed it again. The restart caused a doubling of the request rate on ms-fe1010, and CPU usage increased, indicating that that server was at fault. It reminds me of a similar problem we had years ago.

No sign of errors on the dashboard, except the 5xx rate reported by ATS.

Suggested action items:

  • Install a prometheus collector for nginx
  • @aaron In SwiftFileBackend, could we log the response headers of an HTTP error response? Maybe some sort of truncated body if there is one?
  • Alert when the ATS->Swift 5xx rate increases above a certain value.
  • Daily restarts of swift-proxy?

It's likely that the issue will sporadically continue until it is fully isolated inside swift-proxy and properly fixed.

Headers and body could be logged for 5XXs easily enough in the Swift backend.

I also get many 5xx and stashfailed errors on my current uploads using python requests library.

{'error': {'code': 'stashfailed', 'info': 'An unknown error occurred in storage backend "local-swift-codfw".', '*': 'See https://commons.wikimedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/&gt; for notice of API deprecations and breaking changes.'}, 'servedby': 'mw1388'}

Seems also to be visible on the Grafana board: https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?orgId=1&viewPanel=37&from=now-2d&to=now-1m

Different message on later upload after many 5xx responses and some successful uploads:

{'error': {'code': 'uploadstash-exception', 'info': 'Could not store upload in the stash (UploadStashFileException): "Im Speicher-Backend „local-swift-codfw“ ist ein unbekannter Fehler aufgetreten.".', '*': 'See https://commons.wikimedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/&gt; for notice of API deprecations and breaking changes.'}, 'servedby': 'mw1406'}

I restarted the swift frontends this morning, and the 5xx rate has returned to normal.

Shouldn't we have an automatic alert on this many 5xx's ?

We do monitor and graph the 5xx rate, but it was only ever a tiny fraction of the requests swift was serving (~0.05%).

Still happening. "The MediaWiki error backend-fail-internal occured: An unknown error occurred in storage backend "local-swift-eqiad"."

Change 936368 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/core@master] filebackend: Include truncated http body for 502 on SwiftFileBackend

https://gerrit.wikimedia.org/r/936368

Change 936368 merged by jenkins-bot:

[mediawiki/core@master] filebackend: Include truncated http body for 502 on SwiftFileBackend

https://gerrit.wikimedia.org/r/936368

There is a (small) spike in grafana about 502 errors (and about 504 errors), see https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?orgId=1&viewPanel=37&from=1693494000000&to=1693501200000

Maybe new logs with the added extra information are already collected.

I think this is not linked to the error rate but it makes the need for a fix more urgent. The Wiki Loves Monuments contests started today.

Yesterday, while uploading a small SVG file using bigChunkedUpload.js, the finalize stage took an unusually long time for such a small file and then logged 00019: FAILED: backend-fail-internal: An unknown error occurred in storage backend "local-swift-codfw". The file did seem to finish upload as expected despite this though.

Today, reuploading (overwriting) a small PNG, the finalize stage took a similarly unexpectedly long time, but no error message was logged. Afterwards, however, the thumbnails for the file semi-randomly fail to update. Main thumb on file description page doesn't update; mini-thumb in the upload history list does update; requesting the image thumb at any size using image markup from a client wiki (enWS) returns the old thumb but squeezed into the new narrower width. Various voodoo poking (action=purge etc.) have failed to unwedge it.

These are new after a relatively long period of no real problems experienced with uploads. Timing wise it could roughly correspond with the spiking error rates @Umherirrender notes above, but as these were my only two uploads in the time period that association is very weak at best.

I restarted the swift frontends, and that has returned the error rate to the usual low levels.

This happens again repeatedly with not so big files (around 100 MB):

Chrome console:

mwbot> internal_api_error_UploadChunkFileException: [fab58ddb-22c9-4677-adfc-786416bf388a] Caught exception of type UploadChunkFileException  {"error":{"code":"internal_api_error_UploadChunkFileException","info":"[fab58ddb-22c9-4677-adfc-786416bf388a] Caught exception of type UploadChunkFileException","errorclass":"UploadChunkFileException"},"servedby":"mw2437"}

while uploading https://commons.wikimedia.org/wiki/File:Swiss_national_map,_39_Fl%C3%BCelapass.jpg

I've spent some time checking, and nothing that looks like that filename appears in the swift proxy logs today (and the error rates look as low as normal in general).

Now I get

00411: FAILED: stashfailed: An unknown error occurred in storage backend "local-swift-eqiad"

[I'm afraid my previous comment still applies - nothing relating to a file named like this appears in any of the swift frontend logs]

It seems some of these files are available in Special:UploadStash, but I get T348733 while trying to publish them.

Hi, just wanted to let you guys know I'm getting this exact same error trying to upload a new revision to https://commons.wikimedia.org/wiki/File:The_Sea_Beast_(1926).webm, file size ~ 385 mb

Specific error: 00172: 44/84> FAILED: internal_api_error_UploadChunkFileException: [c7e1fd5a-e9e6-4d21-8a84-e0cb0d31d9f7] Caught exception of type UploadChunkFileException

On https://commons.wikimedia.org/wiki/File:Swiss_national_map,_42_Oberwallis.jpg I still get for every try

00293: FAILED: internal_api_error_UploadChunkFileException: [c995f30e-54b1-490d-b829-ec4c4924acc0] Caught exception of type UploadChunkFileException

On https://commons.wikimedia.org/wiki/File:Swiss_national_map,_42_Oberwallis.jpg I still get for every try

00293: FAILED: internal_api_error_UploadChunkFileException: [c995f30e-54b1-490d-b829-ec4c4924acc0] Caught exception of type UploadChunkFileException

Oct 13, 2023 @ 11:19:46.595 ERROR exception mw2295 commonswiki [c995f30e-54b1-490d-b829-ec4c4924acc0] /w/api.php UploadChunkFileException: Error storing file in '/tmp/phpUeTKDd': backend-fail-internal; local-swift-eqiad
Oct 13, 2023 @ 11:19:46.594 ERROR FileOperation mw2295 commonswiki HTTP 401 (Unauthorized) in 'SwiftFileBackend::doStoreInternal' (given '{"async":false,"src":"/tmp/phpUeTKDd","dst":"mwstore://local-swift-eqiad/local-temp/9/90/1af4kf8yow0c.in02i2.1.jpg.13","overwrite":true,"headers":[]}')
Oct 13, 2023 @ 11:19:46.594 ERROR FileOperation mw2295 commonswiki StoreFileOp failed: {"src":"/tmp/phpUeTKDd","dst":"mwstore://local-swift-eqiad/local-temp/9/90/1af4kf8yow0c.in02i2.1.jpg.13","overwrite":true,"headers":[],"failedAction":"attempt"}

So in this case a 401. That's pretty interesting.

On https://commons.wikimedia.org/wiki/File:Swiss_national_map,_42_Oberwallis.jpg I still get for every try

00293: FAILED: internal_api_error_UploadChunkFileException: [c995f30e-54b1-490d-b829-ec4c4924acc0] Caught exception of type UploadChunkFileException

Oct 13, 2023 @ 11:19:46.595 ERROR exception mw2295 commonswiki [c995f30e-54b1-490d-b829-ec4c4924acc0] /w/api.php UploadChunkFileException: Error storing file in '/tmp/phpUeTKDd': backend-fail-internal; local-swift-eqiad
Oct 13, 2023 @ 11:19:46.594 ERROR FileOperation mw2295 commonswiki HTTP 401 (Unauthorized) in 'SwiftFileBackend::doStoreInternal' (given '{"async":false,"src":"/tmp/phpUeTKDd","dst":"mwstore://local-swift-eqiad/local-temp/9/90/1af4kf8yow0c.in02i2.1.jpg.13","overwrite":true,"headers":[]}')
Oct 13, 2023 @ 11:19:46.594 ERROR FileOperation mw2295 commonswiki StoreFileOp failed: {"src":"/tmp/phpUeTKDd","dst":"mwstore://local-swift-eqiad/local-temp/9/90/1af4kf8yow0c.in02i2.1.jpg.13","overwrite":true,"headers":[],"failedAction":"attempt"}

So in this case a 401. That's pretty interesting.

As SwiftFileBackend::doStoreInternal is mention in the error, this case seems also handled in T206252: Spike of HTTP errors from SwiftFileBackend::doStoreInternal

For the 401 there are information under T228292#7490101 (Where the task description shows the same stack trace as this task description)

I'm doing a bulk upload from LIghtroom this eve and am running into "The MediaWiki error backend-fail-internal occured: An unknown error occurred in storage backend "local-swift-eqiad"." again.

While I managed to upload the files mentioned on Wed, Oct 11, 7:28 PM after many attempts, I still get this error, now with

with the following messages, respectively:

00156: FAILED: internal_api_error_UploadChunkFileException: [0fa38f19-d4b2-49d4-b858-134dc09a519a] Caught exception of type UploadChunkFileException

00213: FAILED: internal_api_error_UploadChunkFileException: [500587b3-7b15-4621-b8a1-8416b1110235] Caught exception of type UploadChunkFileException

00318: FAILED: stashfailed: An unknown error occurred in storage backend "local-swift-eqiad".

Since the cause is not the same old failure of the proxy servers, maybe it would improve clarity if we discussed this incident on a separate task.

What I'm seeing is a total cessation of successful delete operations and a flood of MW log messages showing 401 status codes for various actions.

Since the cause is not the same old failure of the proxy servers, maybe it would improve clarity if we discussed this incident on a separate task.

What I'm seeing is a total cessation of successful delete operations and a flood of MW log messages showing 401 status codes for various actions.

I don't understand what you mean. I can't say if this is related to T349127, but I got this error and T341007 in quick succession while uploading files. Now I managed to upload the files mentioned above after ~10 failed attempts. I can't say if this issue is fixed, or if I just got lucky.

I wonder if the auth token just expired while the combined file was being uploaded (but after the PUT started). By the time the chunk deletions are issued, perhaps the token expired. AFAIK, the tokens should last for a day and MediaWiki app servers only reuse them for ~15 minutes (which used to be higher, but they seemed to expire too soon). It would be interesting to see the memcached stats for the token storage on the swift servers (maybe there is eviction pressure for some reason, like some service never reusing tokens and thus flooding memcached).

You can see metrics about swift's memcached usage in grafana e.g.; I've previously looked to see if incidents (i.e. raised 5xx levels) correlated with events in memcached, but haven't found anything. There is one very hot memcached key (I think it's account_info cf headers_to_account_info in the swift source code); this is probably not ideal, but I'm not sure if it's a swift bug or just the way the system is expected to work (and I've no evidence to suggest that this is causing a problem).

index.php?action=raw&ctype=text/javascript&title=User:Rillke/MwJSBot.js&_=2:1136 mwbot> stashfailed: An unknown error occurred in storage backend "local-swift-codfw".

while uploading https://commons.wikimedia.org/wiki/File:Kaibab_trail_at_Grand_Canfon_National_Park.jpg