Page MenuHomePhabricator

"Could not acquire lock" error when publishing larger files
Open, Needs TriagePublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

  • Upload and try to publish a larger file >40 MiB via UploadWizard or Chunked Upload
  • Error appears for several days (even weeks?)
  • Error appears frequently

What happens?:
The error "Konnte keine Sperre erhalten. Jemand anderes macht etwas mit dieser Datei." (Engl. Could not get a lock. Someone else is doing something with this file.) (appears sometimes in different errors) appears

What should have happened instead?:
The file should be uploaded as desired

Software version (on Special:Version page; skip for WMF-hosted wikis like Wikipedia):
Wikimedia Commons

Other information (browser name/version, screenshots, etc.):

Screenshot 2025-02-17 150843.png (400×1 px, 202 KB)

Event Timeline

A_smart_kitten subscribed.

Tentatively adding to the SRE-swift-storage queue in case they can determine what went wrong here, feel free to untag/re-tag as desired though :)

I'm afraid "could not acquire lock" is not an error message that Swift would produce, so I don't think there's anything I can do to help here.

The error message itself seems like it might be being produced by a MediaWiki lock manager, but my reason for tagging SRE-swift-storage was in case Swift had done anything on the back-end that might cause MW to have produced this message. Hmm. I wonder if there would be some logs somewhere that would help to get to the bottom of the root cause here.

We've not had any spikes in errors from Swift recently, so I doubt Swift is to blame here; and I'm afraid we don't keep logs on the front-ends for more than a few days, so I no longer have them from 17th Feb.

@PantheraLeo1359531, has the error occurred for you again in the last few days? If it has, do you know the date/time it's occurred around? (I don't have access to the logs, so I wouldn't be the person trying to diagnose the issue here, but this information may be helpful to someone who might be)

Hi!
Afaik it's rather time-independent and happened also the last days. I remember it to happen around 9PM (CET), but also some hours before. This is usually my "working time" on Commons

Screenshot 2025-03-08 185609.png (479×1 px, 157 KB)

Still happening, no matter by what upload tool

I'm sorry, that must be annoying. I'm afraid from a Swift perspective I don't have anything I could go on to try and find the issue (and given our current monitoring, I think it's unlikely to be Swift at fault); with a large video file it might be a thumbnailing problem, or it could be something else in the stack getting confused.

Upload and try to publish a larger file >40 MiB

What's the total size of the file?

Screenshot 2025-03-08 185609.png (479×1 px, 157 KB)

Still happening, no matter by what upload tool

I would be helpful if we had details as to what time/date an error occurred, to locate any relevant logfiles as well as events around that time.

Yes, thank you for that hint; I try to collect the information.

Here is another one: 378 MiB, occurred 11:22 AM (UTC+0)

Screenshot 2025-03-11 122537.png (314×1 px, 25 KB)

Amongst the problems here are that the proposed initial filename isn't getting to swift - if I run "zgrep -cF 'DaySkyHDR1050B_16K-HDR.tif' /var/log/swift/proxy-access.log.1.gz" on all 12 frontend servers I get 0 hits.

Similarly, each frontend server handles about 47k requests per minute, so it's not even like I can plausibly inspect logs of about the time things went wrong. I think this really needs investigating higher up the stack.

I might be reading the code wrong but it looks like since filebackends lock managers don't set any TTL (checking FileBackend::getScopedFileLocks) it means if a file gets locked and something breaks in the meantime (redis issues or a bug in mw leading to the lock not being freed in case of failure, etc.) noone can upload the same file ever again. May I suggest just flushing old entries in redis and see if that fixes the issue? (I found this in the internet: https://stackoverflow.com/questions/16517439/redis-how-to-delete-all-keys-older-than-3-months) Of course we should add a reasonably long TTL for locks (1 day? 1 week?). Note that lock functions have $timeout but that's timeout to acquire the lock not TTL.

Random note:

Let's please add some prefix to the locks:

127.0.0.1:6381> RANDOMKEY
"0de65intri8ng7cmguqlhk560rl1ulz"
127.0.0.1:6381> RANDOMKEY
"o455wl443nxe5dc90wtgaec7ifzyrt4"
127.0.0.1:6381> RANDOMKEY
"qod6ej29qk76ccd9ujrtqsaobiucjwr"
127.0.0.1:6381> RANDOMKEY
(nil)
127.0.0.1:6381> RANDOMKEY
"rxsbwa9wktis55nyin46qxu7q9kja5o"