backend-fail-internal error while deleting files
Open, Stalled, HighPublic

Description

error code: backend-fail-internal
error info: An unknown error occurred in storage backend "local-swift-eqiad"

Reported at https://commons.wikimedia.org/wiki/Commons:Administrators%27_noticeboard#Serious_deletion_error_issue


Version: wmf-deployment
Severity: major
Whiteboard: aklapper-moreinfo

Details

Reference
bz69760
bzimport added a subscriber: Unknown Object (MLST).
bzimport set Reference to bz69760.
Rillke created this task.Aug 19 2014, 8:54 PM

cagedbirdsinging wrote:

This bug is causing about 1/3 of my attempts to delete files to fail. I then have to refresh my browser before I can finally get the files to delete, especially in mass DRs or nukes.

INeverCry

The same error occurs during file deletions on the German Wikipedia, see [0]. Error message:

Fehler bei Datei-Löschung: Im Speicher-Backend „local-swift-eqiad“ ist ein 
unbekannter Fehler aufgetreten.

[0] https://de.wikipedia.org/wiki/Wikipedia:Administratoren/Anfragen#Probleme_beim_L.C3.B6schen_von_Dateien

This is actuall an urgent issue, it also affects uploads where images or file description pages get corrupted.
Is nobody of the tech team alerted by (hopefully existing) automatic error messages ?

There are unresolved prio bugs in the "Media storage" component. Swift is a vital component of the projects' ability to show images and other media, and it having so many open bugs causes serious ongoing issues, not only on Commons, but everywhere.

wikipedia wrote:

See Screenshot in German Wikipedia: https://de.wikipedia.org/wiki/Datei:Screenshot_Fehler_im_Speicher-Backend.png

This is an urgent issue.

  • Bug 69717 has been marked as a duplicate of this bug. ***

<godog> it is running a bit hot on bandwidth from/to the upload caches but shouldn't be too bad, not sure exactly what mw does when talking to swift
<godog> all that load comes artificially from ms-be1003 having xfs in a funny state
!log reboot ms-be1003, xfs errors/panics

that (rebooting ms-be1003) did it, the proxy mentioned ERRORS and timeouts towards ms-be1003 while attempting to DELETE, which would explain the symptoms.

can you try again and see if it works? thanks!

cagedbirdsinging wrote:

Still getting a bunch of these same errors as I try deletions here on Commons:

API request failed (backend-fail-internal): An unknown error occurred in storage backend "local-swift-eqiad". <i>at Wed, 20 Aug 2014 17:42:39 GMT</i> <u>served by mw1119</u>

Observed the same at Commons, no improvement seen.

API request failed (backend-fail-internal): An unknown error occurred in storage backend "local-swift-eqiad". <i>at Wed, 20 Aug 2014 21:26:31 GMT</i> <u>served by mw1132</u>

cagedbirdsinging wrote:

I just deleted 200+ files from Commons with no errors.

Not sure if this is related, but uploads have been failing with a similar message:

{"error":{"0":["backend-fail-internal","local-swift-eqiad"],"code":"internal-error","info":"An internal error occurred"},"servedby":"mw1202"}

The problem in comment 9 is clearly visible in ganglia. Don't see any obvious more recent issues on the same ganglia graphs.

https://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=cpu_report&s=by+name&c=Swift+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 (may need to adjust time period at the top depending on when you click the link)

(In reply to Fastily from comment #15)

Not sure if this is related, but uploads have been failing with a similar
message:

btw, please provide timestamps for when the errors happened if you have them! (e.g. comments 13/15)

same here, I can't see any obvious issues with swift after rebooting the machine that was causing the high load yesterday.

we are doing some tuning to the nagios alerts we get for swift to detect reoccurence (and a root cause/fix too!)

pierre-selim.huard wrote:

2014-08-21T13:57Z the bug strikes back!

API request failed (backend-fail-delete): Could not delete file "mwstore://local-swift-eqiad/local-public/c/ce/Крушение_поезда_в_московском_метро_15.07.2014.jpg"

Nick added a comment.Aug 21 2014, 2:05 PM

(In reply to Pierre-Selim from comment #19)

2014-08-21T13:57Z the bug strikes back!

API request failed (backend-fail-delete): Could not delete file
"mwstore://local-swift-eqiad/local-public/c/ce/
Крушение_поезда_в_московском_метро_15.07.2014.jpg"

Same file, slightly different error message.

Error deleting file: Could not delete file "mwstore://local-swift-eqiad/local-public/c/ce/Крушение_поезда_в_московском_метро_15.07.2014.jpg".

there were further errors found with swift talking to memcached, I've pushed https://gerrit.wikimedia.org/r/#/c/155629/ to bump that limit, the timeouts are now greatly reduced, not completely eliminated yet though but the impact should be a lot less

  • Bug 69875 has been marked as a duplicate of this bug. ***

(In reply to jeremyb from comment #17)

(In reply to Fastily from comment #15)
> Not sure if this is related, but uploads have been failing with a similar
> message:

btw, please provide timestamps for when the errors happened if you have
them! (e.g. comments 13/15)

Unfortunately I don't have an exact timestamp, but I do know this was happening during the same time deletions were failing. I haven't tried uploading anything since. Will definitely try again sometime this weekend.

So I've done quite a number of uploads and deletions since I lasted posted here, and have not experienced a 'backend-fail-internal' error since. I'm going to go ahead and close this as resolved for now. If anyone else is still experiencing errors, please don't hesitate to reopen! :)

Issue reappeared on [[commons:File:Pheliperodrigues.jpg]]

Error deleting file: Could not delete file "mwstore://local-swift-eqiad/local-public/9/97/Pheliperodrigues.jpg".

misc data points: I'm seeing some attempts in filebackend-ops.log:

2014-09-01 13:42:52 mw1210 commonswiki: MoveFileOp failed (batch #750loigffakv97vzttctb06d3xb1nf6): {"src":"mwstore://local-swift-eqiad/local-public/9/97/Pheliperodrigues.jpg","dst":"mwstore://local-swift-eqiad/local-deleted/q/5/q/q5qea4gleglvwbotppyd5fq5jnye5zz.jpg","overwriteSame":true,"dstExists":false,"failedAction":"attempt"}
2014-09-01 13:43:20 mw1198 commonswiki: MoveFileOp failed (batch #750loighcfplahx48bnr125t45twh4z): {"src":"mwstore://local-swift-eqiad/local-public/9/97/Pheliperodrigues.jpg","dst":"mwstore://local-swift-eqiad/local-deleted/q/5/q/q5qea4gleglvwbotppyd5fq5jnye5zz.jpg","overwriteSame":true,"dstExists":true,"failedAction":"attempt"}
2014-09-01 13:45:49 mw1104 commonswiki: MoveFileOp failed (batch #750loignpnz38ysz6rjotgwq7h5i1os): {"src":"mwstore://local-swift-eqiad/local-public/9/97/Pheliperodrigues.jpg","dst":"mwstore://local-swift-eqiad/local-deleted/q/5/q/q5qea4gleglvwbotppyd5fq5jnye5zz.jpg","overwriteSame":true,"dstExists":true,"failedAction":"attempt"}
2014-09-01 13:45:50 mw1119 commonswiki: MoveFileOp failed (batch #750loignq7eycehvi3b13ykg1ji76su): {"src":"mwstore://local-swift-eqiad/local-public/9/97/Pheliperodrigues.jpg","dst":"mwstore://local-swift-eqiad/local-deleted/q/5/q/q5qea4gleglvwbotppyd5fq5jnye5zz.jpg","overwriteSame":true,"dstExists":true,"failedAction":"attempt"}
2014-09-01 13:46:10 mw1187 commonswiki: MoveFileOp failed (batch #750loigpdo6255p9lg4z3w3826strm2): {"src":"mwstore://local-swift-eqiad/local-public/9/97/Pheliperodrigues.jpg","dst":"mwstore://local-swift-eqiad/local-deleted/q/5/q/q5qea4gleglvwbotppyd5fq5jnye5zz.jpg","overwriteSame":true,"dstExists":true,"failedAction":"attempt"}
2014-09-01 13:47:00 mw1150 commonswiki: MoveFileOp failed (batch #750loigrwlc8cwz29zgmbns0fe5df9l): {"src":"mwstore://local-swift-eqiad/local-public/9/97/Pheliperodrigues.jpg","dst":"mwstore://local-swift-eqiad/local-deleted/q/5/q/q5qea4gleglvwbotppyd5fq5jnye5zz.jpg","overwriteSame":true,"dstExists":true,"failedAction":"attempt"}
2014-09-01 13:48:07 mw1175 commonswiki: MoveFileOp failed (batch #750loiguvhuma8fk8395oakjvgzq66o): {"src":"mwstore://local-swift-eqiad/local-public/9/97/Pheliperodrigues.jpg","dst":"mwstore://local-swift-eqiad/local-deleted/q/5/q/q5qea4gleglvwbotppyd5fq5jnye5zz.jpg","overwriteSame":true,"dstExists":true,"failedAction":"attempt"}
2014-09-01 13:52:58 mw1183 commonswiki: MoveFileOp failed (batch #750loih7eo5eju3z0i7mm7gdued1lxp): {"src":"mwstore://local-swift-eqiad/local-public/9/97/Pheliperodrigues.jpg","dst":"mwstore://local-swift-eqiad/local-deleted/q/5/q/q5qea4gleglvwbotppyd5fq5jnye5zz.jpg","overwriteSame":true,"dstExists":true,"failedAction":"attempt"}
2014-09-01 13:53:04 mw1073 commonswiki: MoveFileOp failed (batch #750loih8ndhfheqxmjb5qpvokp2p452): {"src":"mwstore://local-swift-eqiad/local-public/9/97/Pheliperodrigues.jpg","dst":"mwstore://local-swift-eqiad/local-deleted/q/5/q/q5qea4gleglvwbotppyd5fq5jnye5zz.jpg","overwriteSame":true,"dstExists":true,"failedAction":"attempt"}

and the hashed file seems to be already there:

swift list wikipedia-commons-local-deleted.q5 | grep q/5/q/q5qea4gleglvwbotppyd5fq5jnye5zz.jpg

q/5/q/q5qea4gleglvwbotppyd5fq5jnye5zz.jpg

though no match for that file in swift-backend.log:

$ zgrep -i Pheliperodrigues.jpg swift-backend.log archive/swift-backend.log-20140901.gz archive/swift-backend.log-201408*
$

seemingly a different (but related?) issue

pierre-selim.huard wrote:

Looks like INeverCry finally succeed in deleting that file.

Is the problem described in comment 25 to comment 27 still seen?

Is the problem described in comment 25 to comment 27 still seen?

Aklapper changed the task status from "Open" to "Stalled".Nov 25 2014, 7:47 PM
Fastily removed a subscriber: Fastily.May 21 2015, 4:08 AM
Jdforrester-WMF moved this task from Untriaged to Backlog on the Multimedia board.Sep 4 2015, 6:46 PM
Restricted Application added a subscriber: Matanya. · View Herald TranscriptSep 4 2015, 6:46 PM

Add Comment