Page MenuHomePhabricator

Unable to delete, restore/undelete, move or upload new versions of files on several wikis ("inconsistent state within the internal storage backends")
Closed, ResolvedPublic

Description

We are currently unable to delete, move or upload new versions of multiple files on se.wikimedia.org

Trying to do so (at e.g. this one) results in the error message

Error deleting file: The file "mwstore://local-multiwrite/local-public/9/9c/Etapprapporteringexempel_v3.0.odt" is in an inconsistent state within the internal storage backends

Event Timeline

Vogone triaged this task as Unbreak Now! priority.Feb 25 2016, 7:02 PM
Vogone subscribed.

Also happens on other wikis.

The problem happens on Azerbaijani Wikipedia too. Example:

Capture.PNG (648×1 px, 90 KB)

Thanks for reporting this.
CC'ing @faidon and @aaron.
Wondering if this is DBA (probably not)? Reminds me of old T41221 / T49905.

Aklapper renamed this task from Unable to delete, move or upload new versions of files on se.wikimedia.org (inconsistent state within the internal storage backends) to Unable to delete, move or upload new versions of files on several wikis ("inconsistent state within the internal storage backends").Feb 26 2016, 11:19 AM

@Serkanland @Lokal_Profil does it happen for any file operation on any file at the moment?

This happens only when I want delete any file. But the problem doesn't happen when I want upload new version of file

@Serkanland thanks, could you try deleting Etapprapporteringexempel_v3.0.odt again now ?

@fgiunchedi, I deleted one file, but when I wanted deleted another file, this happened:

Capture.PNG (644×1 px, 94 KB)

Finally I deleted this file :-) file Thank you very much for help!

Mentioned in SAL [2016-02-26T13:26:58Z] <godog> launch swiftrepl continuous replication for unsharded containers on ms-fe1003 T128096

fgiunchedi lowered the priority of this task from Unbreak Now! to Medium.Feb 26 2016, 2:14 PM

I've launched a continuous replication via swiftrepl for unsharded containers, that should keep bringing codfw and eqiad in the same state and thus allow files to be deleted.

other users please confirm if you can delete files successfully now?

Yes, I can. For Azerbaijani Wikipedia I confirm

I can confirm that files can now be moved and new versions be uploaded on se.wikimedia.org

@fgiunchedi Filippo, I've got one now on simplewiki:

https://simple.wikipedia.org/wiki/File:Test-image-for-Yottie.png

Error deleting file: The file "mwstore://local-multiwrite/local-public/d/de/Test-image-for-Yottie.png" is in an inconsistent state within the internal storage backends

Probably caused by T128124. swiftrepl takes time to clean these up.

Krenair renamed this task from Unable to delete, move or upload new versions of files on several wikis ("inconsistent state within the internal storage backends") to Unable to delete, restore/undelete, move or upload new versions of files on several wikis ("inconsistent state within the internal storage backends").Mar 2 2016, 11:57 PM
Krenair added a subscriber: Marshallsumter.

Mentioned in SAL [2016-03-03T10:54:49Z] <godog> replicate swift unsharded -deleted containers eqiad -> codfw T128096

I've reviewed the errors from channel FileOperation and most seem to involved -deleted containers which weren't covered by default by swiftrepl replication, a replication including those too is now running and expected to complete soon

replication of -deleted containers has almost finished, though there seem to be still occasional errors of failed sync check (see also https://logstash.wikimedia.org/#dashboard/temp/AVM9Mc3XO3D718AOSRyU)
and the last one at 12:06 UTC from arwiki

I was able to delete the one on simplewiki just now. Thank you!

fgiunchedi claimed this task.

indeed I can't see further errors since 12:06 UTC yesterday, I'm tentatively closing this but please reopen if it reoccurs

Reopening due to T130487 on bnwiki.

leaving this open until we have a more permanent solution, codfw is now again asynchronous so this shouldn't impact users. logstash dashboard for FileOperation messages I'm looking at: https://logstash.wikimedia.org/#/dashboard/temp/AVPC3ZJZO3D718AOWzOH

This comment was removed by fgiunchedi.

leaving this open until we have a more permanent solution, codfw is now again asynchronous so this shouldn't impact users. logstash dashboard for FileOperation messages I'm looking at: https://logstash.wikimedia.org/#/dashboard/temp/AVPC3ZJZO3D718AOWzOH

Well, it's still possible, the rate just rate went down.

I don't know if it's relevant here but this file is still in the cache even it's deleted.

I'm seeing very few sync errors in the logs lately.

Change 285687 had a related patch set uploaded (by Aaron Schulz):
Set "autoResync" on for local-multiwrite

https://gerrit.wikimedia.org/r/285687

Change 285687 merged by jenkins-bot:
Set "autoResync" on for local-multiwrite

https://gerrit.wikimedia.org/r/285687

+channel:FileOperation in kibana shows only two "failed to resync" events in the last week (and such cases get fixed by swiftrepl automatically later on). This problem looks very rare now.