Page MenuHomePhabricator

Prevent creation of items having the same sitelinks (duplicates) using memcached and database locks
Closed, ResolvedPublic

Description

Problem:
Concurrent edits can create multiple items with the same sitelink, which isn’t supposed to be possible.

Suggested solution:
For each sitelink to be added, take a database lock and set a key in memcached. See T44325#6981283.

Steps to reproduce:
See T44325#6981283.

Event Timeline

Apparently prior to MySQL 5.7 / MariaDB 10.0.2, a connection can only hold one lock at a time, and a GET_LOCK() call will release the previous lock. MediaWiki core currently claims to support MySQL 5.5.8+, so I guess if someone runs Wikibase on an old MySQL version, the changes we make here won’t fully protect them from duplicate items (though they should still work for single sitelinks). We can probably live with that, though. (In production, we run a sufficiently new MariaDB version: 10.4.15, according to Special:Version.)

Proposed procedure, refined from @hoo’s comment:

  • for each sitelink:
    • get the database lock, with the site + title as name
      • if we fail to acquire it, report conflict (with unknown other item)
    • getWithSetCallback the memcached “lock”, with the site + title as key, the item ID as the value, and a low TTL (60 seconds?)
      • if the returned value is not the item ID, load the other item (from the primary DB); if it indeed has a sitelink conflict, report conflict
    • on transaction commit, free the database lock

Rationale:

  • The database lock is the main protection. If the connection drops while we hold the lock, MediaWiki will not automatically reconnect – so if the connection drops right after we get the lock, then the save will fail, which is good because otherwise we might save a conflicting item. However, we don’t want this to happen during the secondary data updates (where items_per_site is updated), which is why we release the lock on transaction commit.
  • The memcached “lock” bridges the gap between the commit of the main transaction and the secondary data update that writes the sitelinks to items_per_site (at which point the regular conflict detection can pick them up, see SqlSiteLinkConflictLookup). The case that two requests concurrently get the key from memcached, see that it’s free, and each write their item ID to memcached, is mostly prevented by the database lock.
  • We write the item ID to memcached, and check the mentioned item before reporting a conflict, to avoid a scenario where a sitelink is quickly added, removed, and then added to another item (we want to allow it on the other item, not detect that as a conflict).

A tiny race condition is still possible with this scheme, I believe:

  • request A gets the database lock
  • request A’s database connection dies, releasing the lock
  • request A gets the value from memcached, sees it’s empty
  • request B gets the database lock
  • request B gets the value from memcached, sees it’s empty
  • request B writes its item ID to memcached
  • request B proceeds to save the item and release the database lock
  • request A writes its item ID to memcached
  • request A tries to save the item, fails because the connection died (and won’t be automatically restored, because a lock was held)
  • request C gets the database lock
  • request C gets the value from memcached, sees request A’s item ID
  • request C tries to get request A’s item, fails because request A never managed to save its item
  • request C believes there is no conflict, saves the item
  • undetected conflict between request B and C

I can’t think of a realistic way to solve this, and I think we can probably just ignore it: in practice, it should be very unlikely. (If we didn’t write the item ID to memcached, and instead assumed that the presence of any key means a conflict, that would prevent this issue, but block more legitimate edits than we want, I believe.)

Hm, unless we really ignore the value in the memcached key, but also remove the memcached entries after successfully writing to items_per_site, as Marius said?

In that case, a failed request A as in the above example would block saves for that sitelink until the TTL expires, but that’s alright. And the “sitelink is quickly added, removed, and then added to another item” case still works, as long as the “added to another item” happens after the first request’s deferred updates finish (and removed the sitelink from memcached). Maybe that’s actually better.

Yeah, actually, this seems nicely symmetrical. We get the database lock, and release it on main transaction commit; and we get the memcached “lock” (set the key), and “release” (delete) it on secondary updates commit.

Hm, apparently memcached even has an add command that atomically sets an entry only if it doesn’t exist. (Thus is exposed in BagOStuff too; in Redis it corresponds to the nx flag). In that case, do we even need the database lock at all? Previously, the database lock prevented that two requests simultaneously get the entry, see it doesn’t exist, create it, and then successfully save (with the lock, that sequence is only possible if one request drops the lock, at which point it will no longer successfully save). But with add it seems like we might not need that at all?

(Of course, it’s possible that the memcached entry is evicted for whatever reason, but that’s true regardless of the database lock, and the database lock also doesn’t protect against this if the eviction happens after we release the lock on main transaction commit.)

Hmm, but if I wire this up with SiteLinkUniquenessValidator (as I had intended), then this will always be used to check all sitelinks of an item, not just added or modified sitelinks. And this would mean that, as long as an item has any sitelinks at all, a second edit to the item would only be possible after the first edit’s secondary data updates have finished. When the item has been edited, but the secondary updates haven’t been finished yet, then the keys for all the sitelinks will exist in memcached, and any other edits to the items will be detected as a conflict (because we’re not using the values for those keys, so we have no idea that it’s a “self-conflict”).

Is this a problem? Is it usual for another edit to an item to start before the secondary updates have finished?

And if it is a problem… do we need to go back to the version again where we write the item ID to the memcached key?

Hm, if we write the item ID to memcached, but assume that any other item ID means a conflict (rather than trying to load that item and making sure), then we avoid that “tiny race condition” from earlier. This should be more okay now than in the proposal of that comment, because now we’re removing the memcached entry when the secondary data update finishes, so it’s okay that we only use the item ID to prevent self-conflicts.

Change 722410 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/Wikibase@master] WIP: Prevent creation of true duplicates

https://gerrit.wikimedia.org/r/722410

So the above change doesn’t work locally, and I think the reason is the database isolation level. I assumed earlier that it’s safe to delete the key from memcached after the transaction writing to items_per_site committed – but this assumes that the other transaction (trying to find conflicts in items_per_site) will READ COMMITTED rows. At the default REPEATABLE READ level, and if that transaction started before the writeback committed, the conflict won’t be seen in the database.

Yeah, the transaction isolation is the problem. The conflict is detected when I use

SqlSiteLinkConflictLookup::getConflictsForItem()
			// $dbr = $this->db->connections()->getWriteConnectionRef();
			$dbr = $this->db->connections()->getWriteConnection( ILoadBalancer::CONN_TRX_AUTOCOMMIT );

to get a separate connection without a running transaction.

Here’s a version of the reproduction commands from T44325#7355099 for easier testing: first, setup once…

const params = { action: 'wbeditentity', new: 'item', data: JSON.stringify( { sitelinks: { [ mw.config.get( 'wgDBname' ) ]: { site: mw.config.get( 'wgDBname' ), title: mw.config.get( 'wgTitle' ) } } } ) };
const api = new mw.Api();

and then repeat the following snippet as often as you want:

await Promise.all( [ api.postWithEditToken( params ), api.postWithEditToken( params ) ].map( promise => promise.then( response => response.entity.id ).then( itemId => {
  console.log( itemId );
  return api.postWithEditToken( { action: 'delete', title: `Item:${itemId}` } );
} ) ) )

This will try to create two items with the same sitelink, and either log a success (if both were created, i.e. the conflict was not detected, bad) or an error (failed-save indicates the conflict was detected, good). Either way, any created item is also deleted afterwards, so you can immediately try again.

Using CONN_TRX_AUTOCOMMIT for the conflict lookup has an unexpected side effect, though. I just got this error message from the API:

The link <a rel="nofollow" class="external text" href="http://localhost/wiki1/index.php/True_duplicate">wiki1:True duplicate</a> is already used by Item <a href="/wiki1/index.php?title=Item:Q2546&action=edit&redlink=1" class="new" title="Item:Q2546 (page does not exist)">Q2546</a>. You may remove it from <a href="/wiki1/index.php?title=Item:Q2546&action=edit&redlink=1" class="new" title="Item:Q2546 (page does not exist)">Q2546</a> if it does not belong there or merge the Items if they are about the exact same topic.

The conflict lookup correctly detected the conflict with Q2546, but the rest of MediaWiki was working in a transaction that started before that item had been committed, so it thought the item didn’t exist and the link became a redlink. I don’t think there’s anything we can do about this – the message is rendered wikitext, the link is a wikilink, we can’t just replace a call to LinkRenderer::makeLink() with makeKnownLink() and call it a day.

On the other hand, I’m pretty sure this error was already possible (albeit unlikely) before: if request A deletes an item, commits, request B starts with an attempt to create another item with the same sitelink, and request A’s secondary data updates only run afterwards (deleting the item’s sitelinks from the database), then request B will still detect a conflict, but will already see that item as deleted, so it should render a redlink as well.

Change 722924 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/Wikibase@master] Use CONN_TRX_AUTOCOMMIT in SqlSiteLinkConflictLookup

https://gerrit.wikimedia.org/r/722924

Change 722925 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/Wikibase@master] Allow unknown item ID in SiteLinkConflictLookup

https://gerrit.wikimedia.org/r/722925

Change 722924 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Use CONN_TRX_AUTOCOMMIT in SqlSiteLinkConflictLookup

https://gerrit.wikimedia.org/r/722924

Change 722925 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Allow unknown item ID in SiteLinkConflictLookup

https://gerrit.wikimedia.org/r/722925

Change 724370 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/Wikibase@wmf/1.38.0-wmf.1] Use CONN_TRX_AUTOCOMMIT in SqlSiteLinkConflictLookup

https://gerrit.wikimedia.org/r/724370

Change 724371 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/Wikibase@wmf/1.38.0-wmf.2] Use CONN_TRX_AUTOCOMMIT in SqlSiteLinkConflictLookup

https://gerrit.wikimedia.org/r/724371

Change 724370 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@wmf/1.38.0-wmf.1] Use CONN_TRX_AUTOCOMMIT in SqlSiteLinkConflictLookup

https://gerrit.wikimedia.org/r/724370

Mentioned in SAL (#wikimedia-operations) [2021-09-28T11:54:22Z] <lucaswerkmeister-wmde@deploy1002> Synchronized php-1.38.0-wmf.1/extensions/Wikibase/repo/includes/Store/Sql/SqlSiteLinkConflictLookup.php: Backport: [[gerrit:724370|Use CONN_TRX_AUTOCOMMIT in SqlSiteLinkConflictLookup (T291377)]] (duration: 00m 57s)

The wmf.1 backport seems to work well on Test Wikidata at the moment, I tried it on https://test.wikidata.org/wiki/Wikidata:T291377_test_page. The wmf.2 backport is on hold, see T281166#7383319. But until the train reaches group1 tomorrow evening, true duplicates should now be less likely on Wikidata.

Change 724371 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@wmf/1.38.0-wmf.2] Use CONN_TRX_AUTOCOMMIT in SqlSiteLinkConflictLookup

https://gerrit.wikimedia.org/r/724371

Mentioned in SAL (#wikimedia-operations) [2021-09-29T11:43:58Z] <lucaswerkmeister-wmde@deploy1002> Synchronized php-1.38.0-wmf.2/extensions/Wikibase/repo/includes/Store/Sql/SqlSiteLinkConflictLookup.php: Backport: [[gerrit:724371|Use CONN_TRX_AUTOCOMMIT in SqlSiteLinkConflictLookup (T291377)]] (duration: 01m 07s)

Change 722410 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Prevent creation of true duplicates

https://gerrit.wikimedia.org/r/722410