Page MenuHomePhabricator

Protect page actions with advisory locks
Open, MediumPublic13 Estimated Story Points

Description

We should use advisory locks to prevent page actions such as deletions, moves, and imports (and maybe also edit) from interfering with each other. This will allow us to present a meaningful message to the user if an action cannot be performed because another action is ongoing. Currently, this situation often leads to backend errors doe to database lock timeouts or to client side browser timeouts. It canalso lead to confusion for the user, when a page that they just moved just vanishes immediately because someone else deleted it at the same time.

Background:
This came out of an analysis of the cause for T382699: Page Deletion Fatal exception of type "Wikimedia\Rdbms\DBQueryError" due to lock timeout:

Currently, the critical section of a page operation will holds a lock on the relevant database tables/rows. Another action trying to update these rows will be blocked, and that block may eventually time out, on the server or on the client side. In the case of T382699, the browser timed out on the original deletion request, so the user tried again, but failed to get a database lock, which led to a fatal error for the second request. The original request eventually completed successfully.

Using an advisory lock that can be tested in a non-blocking way would allow us to immedately inform the user of the situation.

Implementation notes:

  • Locking should be performed by "command" classes like MovePage, DeletePage, etc.
  • Locks should be obtained from a PageLockManager service.
  • Care must be taken that the lock is freed eventually if the process performing the action terminates unexpectedly.
  • We could use one of the LockManager implementation we also use in FileBackend.
  • Alternatively, we could use PoolCounter, which suppotrs automatic lock release. We could also create a LockManager implementation based on PoolCounter.

Event Timeline

Using PageLockManager to get DB locks (GET_LOCK,0) is kind of nice since:

  • No extra TCP connections, since it's the same TCP connection as the relevant DB server, which we already need to connect to anyway.
  • By using the same connection for locks and DB writes, we don't need need high lock timeouts (upper bound of the critical section duration) or fencing tokens to avoid *concurrent* critical sections for DB writes. High timeouts cause objects to be unwritable for a long time if something fails. Fencing tokens still permit *incomplete* critical sections when there are multiple data stores (e.g. DB + ExternalStore + swift), though that's already a risk with multiple data stores (unless mitigated by some pattern of immutable keys (ExternalStore, swift objects with UUID/sha256 keys) outbox + idempotence, or outbox + persistent locks).
  • The locks can automatically release on disconnect from the lock service.

They'd be non-blocking, so T366938 shouldn't be an issue.