DBQueryError when discouraging translation
Closed, ResolvedPublic
Actions

Description

Tried to discourage translation of https://meta.wikimedia.org/wiki/Abuse_filter_editors

[XWcXowpAICsAAL5HABMAAACQ] 2019-08-29 00:09:24: Fatal exception of type "Wikimedia\Rdbms\DBQueryError"

Tried again, and it worked

Details

	Subject	Repo	Branch	Lines +/-
	Defer message group stats updating when (dis\|en)couraging pages	mediawiki/extensions/Translate	master	+42 -8

Customize query in gerrit

Event Timeline

DannyS712 created this task.Aug 29 2019, 12:11 AM

Restricted Application added a project: User-DannyS712. · View Herald TranscriptAug 29 2019, 12:11 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Query: REPLACE INTO `translate_groupreviews` (tgr_group,tgr_lang,tgr_state) VALUES ('page-Abuse filter editors','*priority','discouraged')
Function: MessageGroups::setPriority
Error: 1205 Lock wait timeout exceeded; try restarting transaction (10.64.48.15)

At the same time:

Sub-optimal transaction on DB(s) [10.64.48.15 (metawiki) (TRX#5b688c)]: 
0	0.000550	query-m: REPLACE INTO `translate_groupreviews` (tgr_group,tgr_lang,tgr_state) VALUES ('X') [TRX#5b688c]
1	0.000658	query-m: INSERT INTO `logging` (log_type,log_action,log_timestamp,log_namespace,log_title,log_page,log_params,log_comment_id,log_actor) VALUES ('X') [TRX#5b688c]
2	0.324470	query-m: SELECT page_id,page_title,page_namespace,page_latest,MAX(rt_revision) AS rt_revision,rt_type FROM `page`,`revtag` WHERE (page_id=rt_page) AND rt_type IN ('X') GROUP BY page_id, rt_type ORDER BY page_namespace, page_title  [TRX#5b688c]
3	61.251727	...delay...

@Nikerabbit what is the IP address in your comment above? If that is mine, please redact it

It's the IP of the database server.

DannyS712 moved this task from Unsorted to Others on the User-DannyS712 board.Aug 29 2019, 8:09 AM

The code is roughly:

MessageGroups::getPriority( $id );

MessageGroups::setPriority( $id, $new ); // First query
new ManualLogEntry( 'pagetranslation', $action )->publish(); // Second query
$this->listPages() // Third query
MessageGroupStats::clearGroup( $shared );

$this->listPages() shouldn't take much time in any condition. Hence I believe it is the stats clearing that keeps the transaction open. My plan is to defer it to the job queue. It won't explain what was holding locks on the table, but it should at least avoid affecting web requests. Clearing the stats doesn't need to happen right away, and it isn't such a big deal if it fails. The stats usually self-heal over time due to other changes.

Change 533234 had a related patch set uploaded (by Nikerabbit; owner: Nikerabbit):
[mediawiki/extensions/Translate@master] Defer message group stats updating when (dis|en)couraging pages

https://gerrit.wikimedia.org/r/533234

gerritbot added a project: Patch-For-Review.Aug 29 2019, 2:27 PM

@abi_ This seems quite similar to the issue with the message index, just a different table.

Nikerabbit added a project: Language-Team (Language-2019-July-September).Aug 29 2019, 2:29 PM

Change 533234 merged by jenkins-bot:
[mediawiki/extensions/Translate@master] Defer message group stats updating when (dis|en)couraging pages

https://gerrit.wikimedia.org/r/533234

ReleaseTaggerBot added a project: MW-1.34-notes (1.34.0-wmf.22; 2019-09-10).Sep 5 2019, 8:00 AM

Maintenance_bot removed a project: Patch-For-Review.Sep 5 2019, 8:10 AM

Krinkle moved this task from Untriaged to Aug2019/1.34.wmf.17+ on the Wikimedia-production-error board.Sep 17 2019, 4:09 PM

Nikerabbit claimed this task.Sep 18 2019, 6:50 AM

Nikerabbit moved this task from Backlog to Recheck after deployment on the Language-Team (Language-2019-July-September) board.

Last instance of this "Lock wait timeout exceeded" was at 2019-08-29T00:09:24. Recently there has only been one sub-optimal transaction:

Sub-optimal transaction on DB(s) [10.64.48.15 (metawiki) (TRX#57aa48)]: 
0	3.636585	query-m: REPLACE INTO `translate_groupreviews` (tgr_group,tgr_lang,tgr_state) VALUES ('X') [TRX#57aa48]
1	0.000441	query-m: INSERT INTO `logging` (log_type,log_action,log_timestamp,log_namespace,log_title,log_page,log_params,log_comment_id,log_actor) VALUES ('X') [TRX#57aa48]
2	2.121573	...delay...

The events are very sparse though, so it's hard to say whether this is fully fixed. Below plot of the error over time:

@Nikerabbit from your previous comment everything seems to be ok, can I move this to Done?

I'm leaving it open as a reminder to do one final check before sprint close.

Seems okay for now. Maybe those updates could be wrapper in their own transaction to avoid keeping it open longer than necessary. But it might not help if there is lock contention (I see no reason why simple replace into would take seconds).

DannyS712 moved this task from Others to Resolved tasks (others) on the User-DannyS712 board.Oct 10 2019, 11:03 AM

	F30387932: image.png
	Sep 18 2019, 7:01 AM

DBQueryError when discouraging translationClosed, ResolvedPublicActions

Description

Details

Event Timeline

DBQueryError when discouraging translation
Closed, ResolvedPublic
Actions