Page MenuHomePhabricator

DBQueryError when discouraging translation
Closed, ResolvedPublic

Description

Tried to discourage translation of https://meta.wikimedia.org/wiki/Abuse_filter_editors

[XWcXowpAICsAAL5HABMAAACQ] 2019-08-29 00:09:24: Fatal exception of type "Wikimedia\Rdbms\DBQueryError"

Tried again, and it worked

Event Timeline

Restricted Application added a project: User-DannyS712. · View Herald TranscriptAug 29 2019, 12:11 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Query: REPLACE INTO `translate_groupreviews` (tgr_group,tgr_lang,tgr_state) VALUES ('page-Abuse filter editors','*priority','discouraged')
Function: MessageGroups::setPriority
Error: 1205 Lock wait timeout exceeded; try restarting transaction (10.64.48.15)

At the same time:

Sub-optimal transaction on DB(s) [10.64.48.15 (metawiki) (TRX#5b688c)]: 
0	0.000550	query-m: REPLACE INTO `translate_groupreviews` (tgr_group,tgr_lang,tgr_state) VALUES ('X') [TRX#5b688c]
1	0.000658	query-m: INSERT INTO `logging` (log_type,log_action,log_timestamp,log_namespace,log_title,log_page,log_params,log_comment_id,log_actor) VALUES ('X') [TRX#5b688c]
2	0.324470	query-m: SELECT page_id,page_title,page_namespace,page_latest,MAX(rt_revision) AS rt_revision,rt_type FROM `page`,`revtag` WHERE (page_id=rt_page) AND rt_type IN ('X') GROUP BY page_id, rt_type ORDER BY page_namespace, page_title  [TRX#5b688c]
3	61.251727	...delay...

@Nikerabbit what is the IP address in your comment above? If that is mine, please redact it

It's the IP of the database server.

DannyS712 moved this task from Unsorted to Others on the User-DannyS712 board.Aug 29 2019, 8:09 AM

The code is roughly:

MessageGroups::getPriority( $id );

MessageGroups::setPriority( $id, $new ); // First query
new ManualLogEntry( 'pagetranslation', $action )->publish(); // Second query
$this->listPages() // Third query
MessageGroupStats::clearGroup( $shared );

$this->listPages() shouldn't take much time in any condition. Hence I believe it is the stats clearing that keeps the transaction open. My plan is to defer it to the job queue. It won't explain what was holding locks on the table, but it should at least avoid affecting web requests. Clearing the stats doesn't need to happen right away, and it isn't such a big deal if it fails. The stats usually self-heal over time due to other changes.

Change 533234 had a related patch set uploaded (by Nikerabbit; owner: Nikerabbit):
[mediawiki/extensions/Translate@master] Defer message group stats updating when (dis|en)couraging pages

https://gerrit.wikimedia.org/r/533234

Nikerabbit triaged this task as Medium priority.Aug 29 2019, 2:27 PM
Nikerabbit added a subscriber: abi_.

@abi_ This seems quite similar to the issue with the message index, just a different table.

Change 533234 merged by jenkins-bot:
[mediawiki/extensions/Translate@master] Defer message group stats updating when (dis|en)couraging pages

https://gerrit.wikimedia.org/r/533234

Last instance of this "Lock wait timeout exceeded" was at 2019-08-29T00:09:24. Recently there has only been one sub-optimal transaction:

Sub-optimal transaction on DB(s) [10.64.48.15 (metawiki) (TRX#57aa48)]: 
0	3.636585	query-m: REPLACE INTO `translate_groupreviews` (tgr_group,tgr_lang,tgr_state) VALUES ('X') [TRX#57aa48]
1	0.000441	query-m: INSERT INTO `logging` (log_type,log_action,log_timestamp,log_namespace,log_title,log_page,log_params,log_comment_id,log_actor) VALUES ('X') [TRX#57aa48]
2	2.121573	...delay...

The events are very sparse though, so it's hard to say whether this is fully fixed. Below plot of the error over time:

Jpita added a subscriber: Jpita.Sep 19 2019, 10:38 AM

@Nikerabbit from your previous comment everything seems to be ok, can I move this to Done?

I'm leaving it open as a reminder to do one final check before sprint close.

Nikerabbit closed this task as Resolved.Sep 23 2019, 8:44 AM

Seems okay for now. Maybe those updates could be wrapper in their own transaction to avoid keeping it open longer than necessary. But it might not help if there is lock contention (I see no reason why simple replace into would take seconds).