Description
Details
Related Objects
- Mentioned Here
- T236466: PHP Warning: [data-update-failed]: A data update callback triggered an exception (Wikimedia\Rdbms\Database::makeList: empty input for field wbxl_text_id) [Called from Wikibase\Repo\Content\DataUpdateAdapter::doUpdate in /extensions/Wikibase/repo/includes/Content/DataUpdateAdapter.php at line 62]
T234948: New Wikibase deadlocks on Wikidata wiki since 2019-10-08T00:00:02: Wikibase\Lib\Store\Sql\Terms\{closure} Deadlock found when trying to get lock; try restarting transaction
Event Timeline
Relates probably to things that were backported in T234948.
The maintenance script for the terms migration restarted at around the time this issue seemed to appear.
5:59 PM <jynus> !log killed rebuildItemTerms on mwmaint1002
5:59 PM <+stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
@Ladsgroup will write a patch removing the cron for now so that the script doesn't start up again.
Change 547264 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[operations/puppet@production] mediawiki: Temporary disable rebuildTermIndex
Change 547264 merged by Giuseppe Lavagetto:
[operations/puppet@production] mediawiki: Temporary disable rebuildTermIndex
Edit rate recovered, then got worse again
We are seeing 5-15s lag on db slaves.
Going to revert the 2 backports
Mentioned in SAL (#wikimedia-operations) [2019-10-30T17:26:36Z] <ladsgroup@deploy1001> Synchronized php-1.35.0-wmf.3/extensions/Wikibase: Revert 16:02 UTC T236928 (duration: 01m 04s)
Mentioned in SAL (#wikimedia-operations) [2019-10-30T17:29:15Z] <ladsgroup@deploy1001> Synchronized php-1.35.0-wmf.4/extensions/Wikibase: Revert 16:05 UTC T236928 (duration: 01m 05s)
The replag all seemed to come from one host which was having issues
That seems to have recovered now.
Edits seem to be recovering
From @jcrespo
Rows read on s8 shot up from around 16:12
The original code deploys were:
16:05 <ladsgroup@deploy1001> Synchronized php-1.35.0-wmf.4/extensions/Wikibase: Wikibase deadlock reduction, [[gerrit:547244|Stop locking and use DISTINCT when finding used terms to delete]] (T234948) (duration: 01m 04s)
16:02 <ladsgroup@deploy1001> Synchronized php-1.35.0-wmf.3/extensions/Wikibase: Wikibase deadlock reduction, [[gerrit:547243|Stop locking and use DISTINCT when finding used terms to delete]] (T236466) (duration: 01m 05s)
The maintenance script restarted every hour at XX:30 minutes according to https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/547264/1/modules/mediawiki/manifests/maintenance/wikidata.pp
So the issue existed since the deployment, and the maintenance script just increased the issue to a visible level.
We are seeing 5-15s lag on db slaves.
Offtopic- I don't believe the lag was real- it just that there was so much overload that lag checking reads/connections by mw would take so much time to finish and return results. This is irrelevant for this ticket (the state was bad), but something that maybe could be optimized or could be missleading in other scenarios in the lag checking code?
The maintenance script restarted every 30 mins
Small correction, every hour at XX:30 minutes. Maybe you already knew that, but just in case it could lead to missleading analysis.
Change 547397 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikibase@master] Revert "Stop locking and use DISTINCT when finding used terms to delete"
Change 547397 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Revert "Stop locking and use DISTINCT when finding used terms to delete"
Created a draft incident documentation https://wikitech.wikimedia.org/wiki/Incident_documentation/20191031-wikidata
Incident doc draft up, and the patch involved has been reverted on master so as not to surprise us next week.
So closing this now