Page MenuHomePhabricator

DiscussionTools: LogicException: Database can't find our row and won't let us insert it
Closed, ResolvedPublicPRODUCTION ERROR

Description

https://logstash.wikimedia.org/goto/6bdbd30f2850920e6e76d457e84677f3

Error
trace
[c84e88de4b6a110a586204e8] [no req]   LogicException: Database can't find our row and won't let us insert it
Backtrace:
from /srv/mediawiki/php-1.41.0-wmf.20/extensions/DiscussionTools/includes/ThreadItemStore.php(364)
#0 /srv/mediawiki/php-1.41.0-wmf.20/extensions/DiscussionTools/includes/ThreadItemStore.php(599): MediaWiki\Extension\DiscussionTools\ThreadItemStore->findOrInsertIdButTryHarder(Closure, Closure, boolean)
#1 /srv/mediawiki/php-1.41.0-wmf.20/includes/libs/rdbms/database/Database.php(2449): MediaWiki\Extension\DiscussionTools\ThreadItemStore->MediaWiki\Extension\DiscussionTools\{closure}(Wikimedia\Rdbms\DatabaseMysqli, string)
#2 /srv/mediawiki/php-1.41.0-wmf.20/includes/libs/rdbms/database/DBConnRef.php(119): Wikimedia\Rdbms\Database->doAtomicSection(string, Closure, string)
#3 /srv/mediawiki/php-1.41.0-wmf.20/includes/libs/rdbms/database/DBConnRef.php(663): Wikimedia\Rdbms\DBConnRef->__call(string, array)
#4 /srv/mediawiki/php-1.41.0-wmf.20/extensions/DiscussionTools/includes/ThreadItemStore.php(623): Wikimedia\Rdbms\DBConnRef->doAtomicSection(string, Closure, string)
#5 /srv/mediawiki/php-1.41.0-wmf.20/extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php(209): MediaWiki\Extension\DiscussionTools\ThreadItemStore->insertThreadItems(MediaWiki\Revision\RevisionStoreRecord, MediaWiki\Extension\DiscussionTools\ContentThreadItemSet)
#6 /srv/mediawiki/php-1.41.0-wmf.20/extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php(169): MediaWiki\Extension\DiscussionTools\Maintenance\PersistRevisionThreadItems->processRow(stdClass)
#7 /srv/mediawiki/php-1.41.0-wmf.20/extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php(119): MediaWiki\Extension\DiscussionTools\Maintenance\PersistRevisionThreadItems->process(Wikimedia\Rdbms\SelectQueryBuilder, array)
#8 /srv/mediawiki/php-1.41.0-wmf.20/maintenance/includes/MaintenanceRunner.php(685): MediaWiki\Extension\DiscussionTools\Maintenance\PersistRevisionThreadItems->execute()
#9 /srv/mediawiki/php-1.41.0-wmf.20/maintenance/run.php(51): MediaWiki\Maintenance\MaintenanceRunner->run()
#10 /srv/mediawiki/multiversion/MWScript.php(159): require_once(string)
#11 {main}
frwiki:  Error while processing revid=168268309, pageid=231176
Notes

Happened as part of T315510: Start maintenance script to backfill talk page comment database maintenance script run. Tried to ping @matmarex as the requesting developer via IRC, but was late in the evening, so filling as a task to ensure this is looked at / triaged in some way.

Event Timeline

It looks like it's actually from the https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/945803 deployment yesterday for T323080 / T341811. There are a few hundred occurrences since yesterday, seems like slightly lower volume than the previous errors: https://logstash.wikimedia.org/goto/6bdbd30f2850920e6e76d457e84677f3

image.png (521×2 px, 109 KB)

I thought this error would be impossible, so I wrote a silly error message for it. @Ladsgroup Can you help investigate?

Mentioned in SAL (#wikimedia-operations) [2023-08-08T22:33:07Z] <urbanecm> mwmaint1002: stop persistRevisionThreadItems.php frwiki instance because of T343859 (cc T315510)

From -operations discussion:

22:58 <urbanecm> MatmaRex: are you still here by any chance?
22:58 <urbanecm> i got this error with the script https://www.irccloud.com/pastebin/J1r6BZCe/
23:30 <MatmaRex> urbanecm: sorry, was away for a bit. hopefully it's not very common?
23:31 <MatmaRex> it's a new error from new code. it should be impossible :) so i'll ask amir to look at it
23:32 <MatmaRex> looks like it's from the last deployment, not from the maintenance script. a few hundred since yesterday

i see it in my terminal scrollback, so it definitely comes from the script as well (of course might be caused by the code you linked, but the script errors out as well).

it started to happen fairly regularly for frwiki only (other instances of the script seem to be intact), so i stopped it for now.

I've said it before: Very likely the only reason we are having all of this issues is that jobs or something else are running the DT updates twice in the same context and it's conflicting with itself. If you find the root cause of that and fix it, you automatically fix this and the deadlock issue as well.

ppelberg moved this task from Incoming to Doing on the Editing-team (Kanban Board) board.
ppelberg subscribed.

@matmarex I'm boldly assigning this to you assuming that you're working on this. Tho, please unassign yourself if this is not the case :)

I've said it before: Very likely the only reason we are having all of this issues is that jobs or something else are running the DT updates twice in the same context and it's conflicting with itself. If you find the root cause of that and fix it, you automatically fix this and the deadlock issue as well.

I guess I'll be looking into it. I have a few other projects going right now, but maybe I'll start next week. I hope the amount of errors being logged is not too annoying for now.

Mentioned in SAL (#wikimedia-operations) [2023-08-22T08:31:07Z] <urbanecm> mwmaint1002: Stop frwiki instance of T315510 scripts due to a large volume of T343859 errors

This is still seen in production. About 700 entries per day, across multiple wikis such as itwiki, commonswiki and svwiki. They're all from jobrunner hosts in DT code during a RefreshLinksJob job.

LogicException: Database can't find our row and won't let us insert it

I've been looking into this on-an-off, but it's not my top priority.

There are actually very few wikis that seriously suffer from this, with most errors occurring on commonswiki, and wikidatawiki and frwiki not far behind. I made a pretty chart: https://logstash.wikimedia.org/goto/d1b607669372043884b3bc9ff5071ccc

image.png (1×2 px, 170 KB)

I found T346383: Edits to file description pages schedule recursive refreshLinks jobs unnecessarily while looking into this, and I think it's a significant factor, at least on commonswiki. I'll try to do something about that and see if it helps.

Apart from that, I think the next step is to add more debug logging, to figure out whether the cause is trying to update the same pages repeatedly, or just trying to update too many pages. Then maybe we can reduce the frequency of updates or something.

Change 957789 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/DiscussionTools@master] ThreadItemStore: Add details to row insertion exceptions

https://gerrit.wikimedia.org/r/957789

Change 957400 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/DiscussionTools@wmf/1.41.0-wmf.26] ThreadItemStore: Add details to row insertion exceptions

https://gerrit.wikimedia.org/r/957400

Change 957789 merged by jenkins-bot:

[mediawiki/extensions/DiscussionTools@master] ThreadItemStore: Add details to row insertion exceptions

https://gerrit.wikimedia.org/r/957789

Change 957400 merged by jenkins-bot:

[mediawiki/extensions/DiscussionTools@wmf/1.41.0-wmf.26] ThreadItemStore: Add details to row insertion exceptions

https://gerrit.wikimedia.org/r/957400

Mentioned in SAL (#wikimedia-operations) [2023-09-14T20:32:25Z] <thcipriani@deploy1002> Started scap: Backport for [[gerrit:957399|Don't offer visual diffs for non-wikitext pages (T346252)]], [[gerrit:957400|ThreadItemStore: Add details to row insertion exceptions (T343859)]]

Mentioned in SAL (#wikimedia-operations) [2023-09-14T20:33:57Z] <thcipriani@deploy1002> thcipriani and matmarex: Backport for [[gerrit:957399|Don't offer visual diffs for non-wikitext pages (T346252)]], [[gerrit:957400|ThreadItemStore: Add details to row insertion exceptions (T343859)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD

Mentioned in SAL (#wikimedia-operations) [2023-09-14T20:45:01Z] <thcipriani@deploy1002> Finished scap: Backport for [[gerrit:957399|Don't offer visual diffs for non-wikitext pages (T346252)]], [[gerrit:957400|ThreadItemStore: Add details to row insertion exceptions (T343859)]] (duration: 12m 35s)

So I'm seeing fun things in the new logging. For example, https://it.wikipedia.org/wiki/Wikipedia:Bar/Discussioni/Interlink_di_nuovo_maiuscoli (a tiny archived discussion with no edits in the last 10 years) has caused this error 63 times in the last week, every couple of hours (https://logstash.wikimedia.org/goto/da6dae1e6e3faff37ea812b0e8dc1ac5).

First, why is it getting refreshLinks processing all the time? Because it technically transcludes https://it.wikipedia.org/wiki/Aiuto:Sportello_informazioni, which is a huge and active discussion page, due to this module: https://it.wikipedia.org/w/index.php?title=Modulo:Vedi_anche&oldid=130657660#L-31 looking up whether that page's title should be shown with a lowercase first letter. Sigh.

Second, why is it failing? Surely there's no way this is a deadlock issue. I had a look at the database, and it looks like it's affected by T339882 – there are rows with itr_itemid_id=0 or itr_items_id=0. We closed that task thinking that these rows should automatically fix themselves, rather than cause new issues, so I need to re-investigate that.

So the thinking was that rows with corrupted fields would be ignored by everything, and new fixed rows would be inserted automatically. This indeed works for corrupted itp_items_id or itr_itemid_id. However, for rows with corrupted itr_items_id, it doesn't, because this field is not in the unique index on that table. That's why we get a duplicate key problem – we try to insert a fixed row, identical except for itr_items_id, and violating the unique index. I think we can just update them on conflicts.

Change 959370 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/DiscussionTools@master] ThreadItemStore: Fix rows with itr_items_id=0 corrupted by T339882

https://gerrit.wikimedia.org/r/959370

I tested this locally by manually corrupting a database row in this manner, reproducing the exception, applying the patch, and then seeing the exception gone and the row fixing itself.

Change 959370 merged by jenkins-bot:

[mediawiki/extensions/DiscussionTools@master] ThreadItemStore: Fix rows with itr_items_id=0 corrupted by T339882

https://gerrit.wikimedia.org/r/959370

matmarex added a project: Skipped QA.

That seems to have done it – there have been no more errors since the wmf.30 rollout earlier today!