backend-fail-internal error while deleting files
Open, StalledPublic

Description

error code: backend-fail-internal
error info: An unknown error occurred in storage backend "local-swift-eqiad"

Reported at https://commons.wikimedia.org/wiki/Commons:Administrators%27_noticeboard#Serious_deletion_error_issue


Version: wmf-deployment
Severity: major
Whiteboard: aklapper-moreinfo

bzimport added a project: Wikimedia-Media-storage.Via ConduitNov 22 2014, 3:44 AM
bzimport added a subscriber: wikibugs-l.
bzimport set Reference to bz69760.
Rillke created this task.Via LegacyAug 19 2014, 8:54 PM
bzimport added a comment.Via ConduitAug 20 2014, 2:19 AM

cagedbirdsinging wrote:

This bug is causing about 1/3 of my attempts to delete files to fail. I then have to refresh my browser before I can finally get the files to delete, especially in mass DRs or nukes.

INeverCry

Ireas added a comment.Via ConduitAug 20 2014, 10:17 AM

The same error occurs during file deletions on the German Wikipedia, see [0]. Error message:

Fehler bei Datei-Löschung: Im Speicher-Backend „local-swift-eqiad“ ist ein 
unbekannter Fehler aufgetreten.

[0] https://de.wikipedia.org/wiki/Wikipedia:Administratoren/Anfragen#Probleme_beim_L.C3.B6schen_von_Dateien

Denniss added a comment.Via ConduitAug 20 2014, 10:33 AM

This is actuall an urgent issue, it also affects uploads where images or file description pages get corrupted.
Is nobody of the tech team alerted by (hopefully existing) automatic error messages ?

Steinsplitter added a comment.Via ConduitAug 20 2014, 10:37 AM

There are unresolved prio bugs in the "Media storage" component. Swift is a vital component of the projects' ability to show images and other media, and it having so many open bugs causes serious ongoing issues, not only on Commons, but everywhere.

bzimport added a comment.Via ConduitAug 20 2014, 10:47 AM

wikipedia wrote:

See Screenshot in German Wikipedia: https://de.wikipedia.org/wiki/Datei:Screenshot_Fehler_im_Speicher-Backend.png

This is an urgent issue.

Ciencia_Al_Poder added a comment.Via ConduitAug 20 2014, 11:25 AM
  • Bug 69717 has been marked as a duplicate of this bug. ***
Aklapper added a comment.Via ConduitAug 20 2014, 1:50 PM

<godog> it is running a bit hot on bandwidth from/to the upload caches but shouldn't be too bad, not sure exactly what mw does when talking to swift
<godog> all that load comes artificially from ms-be1003 having xfs in a funny state
!log reboot ms-be1003, xfs errors/panics

fgiunchedi added a comment.Via ConduitAug 20 2014, 2:24 PM

that (rebooting ms-be1003) did it, the proxy mentioned ERRORS and timeouts towards ms-be1003 while attempting to DELETE, which would explain the symptoms.

can you try again and see if it works? thanks!

bzimport added a comment.Via ConduitAug 20 2014, 5:45 PM

cagedbirdsinging wrote:

Still getting a bunch of these same errors as I try deletions here on Commons:

API request failed (backend-fail-internal): An unknown error occurred in storage backend "local-swift-eqiad". <i>at Wed, 20 Aug 2014 17:42:39 GMT</i> <u>served by mw1119</u>

Denniss added a comment.Via ConduitAug 20 2014, 8:45 PM

Observed the same at Commons, no improvement seen.

Steinsplitter added a comment.Via ConduitAug 20 2014, 9:27 PM

API request failed (backend-fail-internal): An unknown error occurred in storage backend "local-swift-eqiad". <i>at Wed, 20 Aug 2014 21:26:31 GMT</i> <u>served by mw1132</u>

bzimport added a comment.Via ConduitAug 21 2014, 3:34 AM

cagedbirdsinging wrote:

I just deleted 200+ files from Commons with no errors.

Fastily added a comment.Via ConduitAug 21 2014, 6:23 AM

Not sure if this is related, but uploads have been failing with a similar message:

{"error":{"0":["backend-fail-internal","local-swift-eqiad"],"code":"internal-error","info":"An internal error occurred"},"servedby":"mw1202"}

jeremyb added a comment.Via ConduitAug 21 2014, 6:34 AM

The problem in comment 9 is clearly visible in ganglia. Don't see any obvious more recent issues on the same ganglia graphs.

https://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=cpu_report&s=by+name&c=Swift+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 (may need to adjust time period at the top depending on when you click the link)

jeremyb added a comment.Via ConduitAug 21 2014, 6:38 AM

(In reply to Fastily from comment #15)

Not sure if this is related, but uploads have been failing with a similar
message:

btw, please provide timestamps for when the errors happened if you have them! (e.g. comments 13/15)

fgiunchedi added a comment.Via ConduitAug 21 2014, 1:12 PM

same here, I can't see any obvious issues with swift after rebooting the machine that was causing the high load yesterday.

we are doing some tuning to the nagios alerts we get for swift to detect reoccurence (and a root cause/fix too!)

bzimport added a comment.Via ConduitAug 21 2014, 1:59 PM

pierre-selim.huard wrote:

2014-08-21T13:57Z the bug strikes back!

API request failed (backend-fail-delete): Could not delete file "mwstore://local-swift-eqiad/local-public/c/ce/Крушение_поезда_в_московском_метро_15.07.2014.jpg"

Nick added a comment.Via ConduitAug 21 2014, 2:05 PM

(In reply to Pierre-Selim from comment #19)

2014-08-21T13:57Z the bug strikes back!

API request failed (backend-fail-delete): Could not delete file
"mwstore://local-swift-eqiad/local-public/c/ce/
Крушение_поезда_в_московском_метро_15.07.2014.jpg"

Same file, slightly different error message.

Error deleting file: Could not delete file "mwstore://local-swift-eqiad/local-public/c/ce/Крушение_поезда_в_московском_метро_15.07.2014.jpg".

fgiunchedi added a comment.Via ConduitAug 21 2014, 8:54 PM

there were further errors found with swift talking to memcached, I've pushed https://gerrit.wikimedia.org/r/#/c/155629/ to bump that limit, the timeouts are now greatly reduced, not completely eliminated yet though but the impact should be a lot less

Aklapper added a comment.Via ConduitAug 21 2014, 9:45 PM
  • Bug 69875 has been marked as a duplicate of this bug. ***
Fastily added a comment.Via ConduitAug 22 2014, 6:33 AM

(In reply to jeremyb from comment #17)

(In reply to Fastily from comment #15)
> Not sure if this is related, but uploads have been failing with a similar
> message:

btw, please provide timestamps for when the errors happened if you have
them! (e.g. comments 13/15)

Unfortunately I don't have an exact timestamp, but I do know this was happening during the same time deletions were failing. I haven't tried uploading anything since. Will definitely try again sometime this weekend.

Fastily added a comment.Via ConduitAug 23 2014, 10:40 PM

So I've done quite a number of uploads and deletions since I lasted posted here, and have not experienced a 'backend-fail-internal' error since. I'm going to go ahead and close this as resolved for now. If anyone else is still experiencing errors, please don't hesitate to reopen! :)

Dereckson added a comment.Via ConduitSep 1 2014, 2:45 PM

Issue reappeared on [[commons:File:Pheliperodrigues.jpg]]

Error deleting file: Could not delete file "mwstore://local-swift-eqiad/local-public/9/97/Pheliperodrigues.jpg".

fgiunchedi added a comment.Via ConduitSep 2 2014, 7:23 AM

misc data points: I'm seeing some attempts in filebackend-ops.log:

2014-09-01 13:42:52 mw1210 commonswiki: MoveFileOp failed (batch #750loigffakv97vzttctb06d3xb1nf6): {"src":"mwstore://local-swift-eqiad/local-public/9/97/Pheliperodrigues.jpg","dst":"mwstore://local-swift-eqiad/local-deleted/q/5/q/q5qea4gleglvwbotppyd5fq5jnye5zz.jpg","overwriteSame":true,"dstExists":false,"failedAction":"attempt"}
2014-09-01 13:43:20 mw1198 commonswiki: MoveFileOp failed (batch #750loighcfplahx48bnr125t45twh4z): {"src":"mwstore://local-swift-eqiad/local-public/9/97/Pheliperodrigues.jpg","dst":"mwstore://local-swift-eqiad/local-deleted/q/5/q/q5qea4gleglvwbotppyd5fq5jnye5zz.jpg","overwriteSame":true,"dstExists":true,"failedAction":"attempt"}
2014-09-01 13:45:49 mw1104 commonswiki: MoveFileOp failed (batch #750loignpnz38ysz6rjotgwq7h5i1os): {"src":"mwstore://local-swift-eqiad/local-public/9/97/Pheliperodrigues.jpg","dst":"mwstore://local-swift-eqiad/local-deleted/q/5/q/q5qea4gleglvwbotppyd5fq5jnye5zz.jpg","overwriteSame":true,"dstExists":true,"failedAction":"attempt"}
2014-09-01 13:45:50 mw1119 commonswiki: MoveFileOp failed (batch #750loignq7eycehvi3b13ykg1ji76su): {"src":"mwstore://local-swift-eqiad/local-public/9/97/Pheliperodrigues.jpg","dst":"mwstore://local-swift-eqiad/local-deleted/q/5/q/q5qea4gleglvwbotppyd5fq5jnye5zz.jpg","overwriteSame":true,"dstExists":true,"failedAction":"attempt"}
2014-09-01 13:46:10 mw1187 commonswiki: MoveFileOp failed (batch #750loigpdo6255p9lg4z3w3826strm2): {"src":"mwstore://local-swift-eqiad/local-public/9/97/Pheliperodrigues.jpg","dst":"mwstore://local-swift-eqiad/local-deleted/q/5/q/q5qea4gleglvwbotppyd5fq5jnye5zz.jpg","overwriteSame":true,"dstExists":true,"failedAction":"attempt"}
2014-09-01 13:47:00 mw1150 commonswiki: MoveFileOp failed (batch #750loigrwlc8cwz29zgmbns0fe5df9l): {"src":"mwstore://local-swift-eqiad/local-public/9/97/Pheliperodrigues.jpg","dst":"mwstore://local-swift-eqiad/local-deleted/q/5/q/q5qea4gleglvwbotppyd5fq5jnye5zz.jpg","overwriteSame":true,"dstExists":true,"failedAction":"attempt"}
2014-09-01 13:48:07 mw1175 commonswiki: MoveFileOp failed (batch #750loiguvhuma8fk8395oakjvgzq66o): {"src":"mwstore://local-swift-eqiad/local-public/9/97/Pheliperodrigues.jpg","dst":"mwstore://local-swift-eqiad/local-deleted/q/5/q/q5qea4gleglvwbotppyd5fq5jnye5zz.jpg","overwriteSame":true,"dstExists":true,"failedAction":"attempt"}
2014-09-01 13:52:58 mw1183 commonswiki: MoveFileOp failed (batch #750loih7eo5eju3z0i7mm7gdued1lxp): {"src":"mwstore://local-swift-eqiad/local-public/9/97/Pheliperodrigues.jpg","dst":"mwstore://local-swift-eqiad/local-deleted/q/5/q/q5qea4gleglvwbotppyd5fq5jnye5zz.jpg","overwriteSame":true,"dstExists":true,"failedAction":"attempt"}
2014-09-01 13:53:04 mw1073 commonswiki: MoveFileOp failed (batch #750loih8ndhfheqxmjb5qpvokp2p452): {"src":"mwstore://local-swift-eqiad/local-public/9/97/Pheliperodrigues.jpg","dst":"mwstore://local-swift-eqiad/local-deleted/q/5/q/q5qea4gleglvwbotppyd5fq5jnye5zz.jpg","overwriteSame":true,"dstExists":true,"failedAction":"attempt"}

and the hashed file seems to be already there:

swift list wikipedia-commons-local-deleted.q5 | grep q/5/q/q5qea4gleglvwbotppyd5fq5jnye5zz.jpg

q/5/q/q5qea4gleglvwbotppyd5fq5jnye5zz.jpg

fgiunchedi added a comment.Via ConduitSep 2 2014, 7:31 AM

though no match for that file in swift-backend.log:

$ zgrep -i Pheliperodrigues.jpg swift-backend.log archive/swift-backend.log-20140901.gz archive/swift-backend.log-201408*
$

seemingly a different (but related?) issue

bzimport added a comment.Via ConduitSep 2 2014, 7:44 AM

pierre-selim.huard wrote:

Looks like INeverCry finally succeed in deleting that file.

Aklapper added a comment.Via ConduitSep 22 2014, 10:49 AM

Is the problem described in comment 25 to comment 27 still seen?

Aklapper added a comment.Via ConduitOct 10 2014, 4:32 PM

Is the problem described in comment 25 to comment 27 still seen?

Aklapper changed the task status from "Open" to "Stalled".Via WebNov 25 2014, 7:47 PM
Gilles added a project: Multimedia.Via WebDec 4 2014, 9:22 AM
Fastily removed a subscriber: Fastily.Via WebThu, May 21, 4:08 AM

Add Comment