[REPO][SW] DatabaseTermStoreWriterBase::acquireAndInsertTerms() shouldn't lock all rows for the item
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Ladsgroup
	May 15 2024, 6:37 PM

Description

Currently, one of the biggest transaction time warnings we have for production is this:

SELECT wbit_term_in_lang_id AS `value` FROM `wbt_item_terms` WHERE wbit_item_id = N FOR UPDATE

(see https://logstash.wikimedia.org/goto/3e08a3ceb6d8352b13217ff6862cbfe7)

Given that this table is secondary data with the ability to self heal after reparse and specially since wbt_item_terms table does not hold the dictionary to be used as foreign key (unlike wbt_text). That select for update seems to be a scalability issue doubly so given the rate of edits in Wikidata. Can we just remove that?

Details

	Subject	Repo	Branch	Lines +/-
	DatabaseTermStoreWriterBase: Clarify best effort basis	mediawiki/extensions/Wikibase	master	+9 -5
	Don't use FOR UPDATE in acquireAndInsertTerms	mediawiki/extensions/Wikibase	master	+11 -9

Customize query in gerrit

Related Objects

Mentioned In: T368571: Wikimedia\Rdbms\DBQueryError: Error 1213: Deadlock found when trying to get lock; try restarting transactionFunction: Wikibase\Lib\Store\Sql\Terms\Util\ReplicaPrimaryAwareRecordIdsAcquirer::insertNonExistingRecordsIntoPrimary
Mentioned Here: T283198: [LIB] Database deadlock in Wikibase\Lib\Store\Sql\Terms\Util\ReplicaMasterAwareRecordIdsAcquirer::insertNonExistingRecordsIntoMaster query
T309445: [Investigation] Terms storage missing most item labels after merge, resulting in item IDs being shown
T362084: [SW] [WBQC] shellbox-constraints returning 500 on preg_match error

Event Timeline

• Ladsgroup created this task.May 15 2024, 6:37 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 15 2024, 6:37 PM

ItamarWMDE added a project: wmde-wikidata-tech.May 16 2024, 8:19 AM

ItamarWMDE subscribed.

• Ladsgroup updated the task description. (Show Details)May 16 2024, 10:30 AM

Lucas_Werkmeister_WMDE renamed this task from DatabaseTermStoreWriterBase::acquireAndInsertTerms() shouldn't lock all rows for the item to [REPO][SW] DatabaseTermStoreWriterBase::acquireAndInsertTerms() shouldn't lock all rows for the item.May 16 2024, 12:02 PM

Lucas_Werkmeister_WMDE moved this task from Incoming to [DOT] By Project on the wmde-wikidata-tech board.

Lucas_Werkmeister_WMDE added a project: Wikidata Dev Team.

Prio Notes:

Impact Area	Affected
production / end users	✔
monitoring	✘
development efforts	✘
onboarding efforts	✘
additional stakeholders	✔

Prio note: We’re prioritizing this above T362084, despite the lower “score”, because a database performance issue seems more important to us.

Given that this table is secondary data with the ability to self heal after reparse

I’m skeptical of this proclaimed ability; one important gap I’m aware of is that, as far as I know, we never remove “old” unused text entries from the storage – we rely on them being removed as soon as they become unused (which requires some locking tricks to make sure nothing breaks if a concurrent transaction re-uses the text entry). IMHO this needs more investigation before we can drop the locking.

It’s possible that we should change the approach we take to ensure people can’t see deleted terms in the cloud replicas / on Quarry (which is the main reason we care about cleaning them up, I think: we don’t want revdeleted labels with PII to stay visible there). Currently, we try pretty hard to actually delete the terms from the secondary storage, but this causes all sorts of locking headaches (T283198) and occasionally missing data (T309445). It might be better to never¹ delete terms from the primary storage, and instead just change the cloud replicas views to only return data that’s referenced (recursively) from wbt_item_terms or wbt_property_terms – shifting the responsibility of hiding sensitive data from the production databases to the cloud replicas. (I think I’ve commented something like this on another task before, but I can’t find it now.)

¹ (Or if we’re worried about the size of the secondary storage exploding – somehow still delete terms, but only if it seems like they’re not used and won’t be used. Maybe a job can scan for unused IDs, submit them to a second job to run a week later, and the second job actually deletes them if they’re still unused?)

I agree that the pruning part needs to be rethought and revamped but I'm failing to see how it's connected to this problem. This is explicitly about locking the rows in wbt_item_terms which is the head. Let's say for Q1234 you have this:

+------------+--------------+----------------------+
| wbit_id    | wbit_item_id | wbit_term_in_lang_id |
+------------+--------------+----------------------+
|  767752395 |     1234 |                    5 |
| 6682347558 |    1234 |                  172 |
|   95594787 |      1234 |                  192 |
|   34102141 |       1234 |                  237 |
|  561818268 |     1234 |                  274 |
+------------+--------------+----------------------+

And if you don't lock the rows in wbt_item_terms, and if all of these get corrupted and deleted and as result entries in wbt_term_in_lang also get deleted (and even entries in wbt_text_in_lang also get deleted). A reparse should fix all of this and reinsert any missing entries in all of the tables if they get deleted by the prune job. i.e. even if you corrupt from the head to the leaf via race conditions on the deletion job. It still should recover with reparse, from my understanding of how it is built. What am I missing here?

Also noting that I'm talking about the lock in ::acquireAndInsertTerms() and not the one on ::deleteTermsWithoutClean() as the latter happens much less frequently and doesn't lock contention from what I'm seeing.

Arian_Bozorg moved this task from Incoming to [DOT] Tech Backlog on the Wikidata Dev Team board.May 21 2024, 2:35 PM

karapayneWMDE moved this task from [DOT] Tech Backlog to Unified DOT Backlog on the Wikidata Dev Team board.Jun 5 2024, 11:38 AM

hoo claimed this task.Jun 18 2024, 2:29 PM

hoo edited projects, added Wikidata Dev Team (Wikidata.org Slice); removed Wikidata Dev Team.

hoo moved this task from In Task Breakdown to In Development on the Wikidata Dev Team (Wikidata.org Slice) board.Jun 18 2024, 2:32 PM

Maintenance_bot moved this task from [DOT] Prioritized to Ongoing on the wmde-wikidata-tech board.Jun 18 2024, 3:29 PM

• Ladsgroup mentioned this in T368571: Wikimedia\Rdbms\DBQueryError: Error 1213: Deadlock found when trying to get lock; try restarting transactionFunction: Wikibase\Lib\Store\Sql\Terms\Util\ReplicaPrimaryAwareRecordIdsAcquirer::insertNonExistingRecordsIntoPrimary.Jun 26 2024, 7:17 PM

Change #1050658 had a related patch set uploaded (by Hoo man; author: Hoo man):

[mediawiki/extensions/Wikibase@master] Don't use FOR UPDATE in acquireAndInsertTerms

https://gerrit.wikimedia.org/r/1050658

gerritbot added a project: Patch-For-Review.Jun 28 2024, 6:02 PM

hoo moved this task from In Development to Ready for Peer Review on the Wikidata Dev Team (Wikidata.org Slice) board.Jun 28 2024, 6:06 PM

Lucas_Werkmeister_WMDE moved this task from Ready for Peer Review to In Peer Review on the Wikidata Dev Team (Wikidata.org Slice) board.Jul 8 2024, 10:03 AM

Change #1054359 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/Wikibase@master] DatabaseTermStoreWriterBase: Clarify best effort basis

https://gerrit.wikimedia.org/r/1054359

Change #1050658 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Don't use FOR UPDATE in acquireAndInsertTerms

https://gerrit.wikimedia.org/r/1050658

Given that this table is secondary data with the ability to self heal after reparse

Minor correction: AFAICT we only update the term store on actual edits, not on e.g. ?action=purge or a “null edit” (I tested action=wbeditentity with data={}).

ReleaseTaggerBot added a project: MW-1.43-notes (1.43.0-wmf.14; 2024-07-16).Jul 15 2024, 4:00 PM

Moving back to review for the follow-up clarification patch.

Meanwhile, the specific query vanished from the slow queries dashboard (https://logstash.wikimedia.org/goto/e6310eb39c862c3180e927bf3109b01a):

Though the effect is harder to see when not filtering for that particular query (https://logstash.wikimedia.org/goto/6b3cb20df91e9deb963975e1ecfda535):

hoo moved this task from Ready for Peer Review to Ready for Tech Verification on the Wikidata Dev Team (Wikidata.org Slice) board.Aug 14 2024, 10:54 AM

Change #1054359 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] DatabaseTermStoreWriterBase: Clarify best effort basis

https://gerrit.wikimedia.org/r/1054359

Maintenance_bot removed a project: Patch-For-Review.Aug 14 2024, 11:30 AM

ReleaseTaggerBot edited projects, added MW-1.43-notes (1.43.0-wmf.19; 2024-08-20); removed MW-1.43-notes (1.43.0-wmf.14; 2024-07-16).Aug 14 2024, 12:00 PM

In T365055#10038898, @Lucas_Werkmeister_WMDE wrote:

Though the effect is harder to see when not filtering for that particular query (https://logstash.wikimedia.org/goto/6b3cb20df91e9deb963975e1ecfda535):

Eh, now that the big spike at the beginning of that chart is more than 90 days ago, I think the decrease in slow queries since mid-July is actually relatively visible:

IMHO that’s good enough to close this task (unless @Ladsgroup wants to reopen it).

If you measure the "write" times only (issues on the master vs. random replicas being slow), the change is much more visible https://logstash.wikimedia.org/goto/0474b926f6f8cf84846ec87b38335f53

	F57281326: grafik.png
	Aug 19 2024, 1:07 PM

	F57276579: image.png
	Aug 16 2024, 9:24 AM

	F56924205: image.png
	Aug 2 2024, 2:54 PM

	F56924150: image.png
	Aug 2 2024, 2:54 PM

[REPO][SW] DatabaseTermStoreWriterBase::acquireAndInsertTerms() shouldn't lock all rows for the itemClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

[REPO][SW] DatabaseTermStoreWriterBase::acquireAndInsertTerms() shouldn't lock all rows for the item
Closed, ResolvedPublic
Actions