Page MenuHomePhabricator

Wikimedia\Rdbms\Database::selectSQLText: aggregation used with a locking SELECT (NewsletterDb::addNewsletterIssue)
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error message
[deprecated]

 Wikimedia\Rdbms\Database::selectSQLText: aggregation used with a locking SELECT (NewsletterDb::addNewsletterIssue)
Stack Trace
N/A
Impact

Given it is a deprecation warning, it most likely means things are working as intended today. But this may stop working in the future.

Notes

Querying Logstash over the last 30 days, shows it's been triggered every week, so likely this is not a recent regression. The detection for this issue was introduced in 2018 as part of https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/425731/.

Details

Request ID
574d4853-2114-44e7-8bb3-26c67bcb70e7
Request URL
https://www.mediawiki.org/wiki/Special:Newsletter/3/announce

Event Timeline

This is low priority as the syntax is still supported in RDBMS today, however, it's unsure whether this is intentionally using that. Part of the reason to deprecate this in rdbms was that it is often used by accident, so this would be good to check out and make sure that's not to case.

Migration of this blocks removing of that trap from wikimedia-rdbms.

eprodromou added a subscriber: eprodromou.

This seems to be a potential problem, but not one we can put time into right now. Iceboxing until it becomes more urgent.

eprodromou raised the priority of this task from Lowest to Needs Triage.Jun 18 2020, 7:44 PM
eprodromou triaged this task as Lowest priority.
eprodromou moved this task from Inbox to Icebox on the Platform Engineering board.
		$lastIssueId = (int)$dbw->selectField(
			'nl_issues',
			'MAX(nli_issue_id)',
			[ 'nli_newsletter_id' => $newsletter->getId() ],
			__METHOD__,
			[ 'FOR UPDATE' ]
		);

auto increment could be used for nli_issue_id, but than the id is unique for all newsletter issues and not per newsletter (nli_newsletter_id).
I have no idea how unique the number should/must be.

But the table also needs some investigation for T17441: Some tables lack unique or primary keys, may allow confusing duplicate data

One solution could be: No locking on the max value, trying to insert and when it fails, reselect and try again.
I have no knowledge how locking works here in mysql. If only the last row is locked, that useless as that is not updated.
If the gap after the max is locked, than it is useful to avoid another thread to insert something.

The comment for that warning says

// Some DB types (e.g. postgres) disallow FOR UPDATE with aggregate
// functions. Discourage use of such queries to encourage compatibility.

So this is a bug but not really a production bug. The fix is simply to use two separate queries.

I have no knowledge how locking works here in mysql. If only the last row is locked, that useless as that is not updated.
If the gap after the max is locked, than it is useful to avoid another thread to insert something.

The relevant index is (nli_newsletter_id, nli_issue_id) and the query condition is nli_newsletter_id = <some_newsletter_id> so the gap after (<some_newsletter_id>, <max_issue_id>) will be locked to ensure that nothing else can add/change data in the index range defined by that condition. (Or maybe MySQL can set index row locks on prefix subsets of composite indexes? Not sure, but the end result would be the same.) The locking logic is fine here, it's just that getting the max ID in the same query is a micro-optimization that only works in MySQL.

AnnaMikla added a subscriber: Art.tsymbar.

Hi, @Tgr, @tstarling ! Could you please tell, if we can just go with 'ORDER BY nli_issue_id DESC' statement instead of 'MAX' statement, so no aggregation is used any longer, ( $options['LIMIT'] = 1 is added in the Databese::selectField method in any case ). Maybe we could also add an INDEX on 'nli_issue_id', 'nli_newsletter_id' fields,. And that would be great to somehow test the execution speed. Thanks!

Hi, @Tgr, @tstarling ! Could you please tell, if we can just go with 'ORDER BY nli_issue_id DESC' statement instead of 'MAX' statement, so no aggregation is used any longer, ( $options['LIMIT'] = 1 is added in the Databese::selectField method in any case ).

Yes, that should work. After you do that, please test it to confirm that duplicate issue IDs cannot be created. Add a sleep() call after the SELECT and then try to create two issues simultaneously.

Maybe we could also add an INDEX on 'nli_issue_id', 'nli_newsletter_id' fields,. And that would be great to somehow test the execution speed. Thanks!

I think there is already an appropriate index.

Change 699375 had a related patch set uploaded (by Art.tsymbar; author: arttsymbar):

[mediawiki/extensions/Newsletter@master] [NOT FOR REVIEW YET]

https://gerrit.wikimedia.org/r/699375

Thanks, @tstarling for the comment!

I've tested the change with and without sleep().

  • Without sleep(). - all works as expected.
  • With adding sleep(). - I didn't get duplication of issue IDs, but got next error: "Wikimedia\Rdbms\DBUnexpectedError: NewsletterDb::addNewsletterIssue: Expected mass rollback of all peer transactions (DBO_TRX set)"

addnewsletter.png (483×1 px, 86 KB)

And this happens (the error) with the current version with MAX() usage too. (Related to startAtomic()\endAtomic() I suppose).

So could you please tell if we should go with ORDER BY nli_issue_id DESC change and maybe create another task to deal with such error(s) or is it better to deal with it in scope of the current issue ?
Thanks!

On line 491, rollback() should be replaced with cancelAtomic(). It's not correct to call rollback() there. I'm guessing that it's catching a deadlock error, since we know that gap locking with SELECT FOR UPDATE always generates a deadlock error when it is upgraded to an exclusive lock. With cancelAtomic(), the function should return false and the error "A new issue could not be announced. Please try again." should be displayed to the user. I think that's acceptable for this use case.

Hi, @tstarling ! Thanks for the comment! With replacing rollback() with cancelAtomic() as suggested - all works as expected. But if to proceed with testing with sleep() usage I got next error:

screen.png (650×1 px, 178 KB)
on Database::selectField() with next query: "SELECT @@GLOBAL.read_only AS Value" - so not related to NewsletterDb::addNewsletterIssue() method. <- This could be fixed with replacing:
$status = $form->show();

into

something close to:

try {

			$status = $form->show();
		} catch (Wikimedia\Rdbms\DBTransactionStateError $e) {
			$status = false;

			$lb = MediaWikiServices::getInstance()->getDBLoadBalancer();
			$dbw = $lb->getConnection( DB_PRIMARY );
			$dbw->rollback( __METHOD__ , Database::FLUSHING_INTERNAL );
		}

in the SpecialNewsletter::doAnnounceExecute() method.

(tested this) -> received no exceptions whyle testing with sleep(). Though I'm not sure if it's the best solution.

So could you please tell if we could go this way? Thanks!

@aaron can you please comment on the correct way to catch and discard a query error in an atomic section? Is it actually possible?

@aaron, never mind.

@Art.tsymbar on line 465 pass IDatabase::ATOMIC_CANCELABLE to startAtomic. The correct way to catch a query error in an atomic section is:

		$dbw->startAtomic( __METHOD__, IDatabase::ATOMIC_CANCELABLE );
		try {
			... query that may fail...
			$dbw->endAtomic( __METHOD__ );
		} catch ( DBQueryError $e ) {
			$dbw->cancelAtomic( __METHOD__ );
		}

This is documented in IDatabase::startAtomic().

Yeah, that constant causes SAVEPOINTs to be used, which add roundtrip and server memory overhead, but make it possible to rollback. The rare places using startAtomic/endAtomic that expect to catch errors and keep the prior transaction work must specify ATOMIC_CANCELLABLE.

@aaron, never mind.

@Art.tsymbar on line 465 pass IDatabase::ATOMIC_CANCELABLE to startAtomic. The correct way to catch a query error in an atomic section is:

		$dbw->startAtomic( __METHOD__, IDatabase::ATOMIC_CANCELABLE );
		try {
			... query that may fail...
			$dbw->endAtomic( __METHOD__ );
		} catch ( DBQueryError $e ) {
			$dbw->cancelAtomic( __METHOD__ );
		}

This is documented in IDatabase::startAtomic().

Hi @tstarling ! This approach also will through an exception. So I have changed it a little. Please check in code review.

Some pitfalls are implied by https://dev.mysql.com/doc/refman/8.0/en/innodb-error-handling.html . Statement rollbacks and ROLLBACK TO SAVEPOINT queries do not release locks due to implementation details (no tracking of which savepoints caused what locks). InnoDB deadlocks, and innodb lock wait timeouts with nnodb_rollback_on_timeout=On, cause an automatic transaction rollback in order to release the locks and let other transactions progress. Rolling back the whole transaction implicitly deletes the savepoints. You might be running into that problem.

If the errors that occur are just from deadlocks, then I wouldn't bother with ATOMIC_CANCELABLE/cancelAtomic(). I would add call to:

$dbw->lockForUpdate( 'nl_newsletters', [ 'nl_id' => $newsletter->getId() ], __METHOD__);

...in addNewsletterIssue() just after startAtomic(). Since the row is expected to exist, that should serialize transactions trying to publish to the same newsletter, reducing deadlocks. There could still be deadlocks due to adjacent newsletter IDs getting transactions to publish new issues though. I'm not sure how likely that is to happen though.

If deadlocks are still an issue with the above, a second DB connection, via CONN_TRX_AUTOCOMMIT, could be used for this method. A getLock(), keyed on nl_id, could replace the lockForUpdate() query and the FOR UPDATE option in the query to get the last issue ID.

This seems to be trying to solve a similar problem as TableNameStore.

Hi @aaron! I have tested your suggested solution. It works smoothly. All announces were sent without deadlocks and errors for client. Only one question: you are using nl_newsletters table name to lock, but in code we are inserting data into nl_issues table, so maybe we have to use nl_issues table to lock? Just to be sure I have tested both variants:

$dbw->lockForUpdate( 'nl_newsletters', [ 'nl_id' => $newsletter->getId() ], __METHOD__);
AND
$dbw->lockForUpdate( 'nl_issues', [ 'nli_newsletter_id' => $newsletter->getId() ], __METHOD__);

and I was surprised that both of them works.

I chose nl_newsletters since the row should exist even if there have been no newsletter issues yet for the given newletter ID. This would help avoid deadlocks for the "first issue case". SELECT FOR UPDATE for non-existing rows makes gap locks that are prohibitive-only (do not give insert permissions, causing deadlocks on the actual insertion).

Hi! Thanks for answer! Now I understand :)
I have fixed table name and now I think we are done.

Change 699375 merged by jenkins-bot:

[mediawiki/extensions/Newsletter@master] Remove aggregation from addNewsletterIssue method.

https://gerrit.wikimedia.org/r/699375