Not able to mark big pages for translation at mediawiki.org
Closed, ResolvedPublic1 Estimated Story Points
Actions

Assigned To

Authored By

	Amire80
	Feb 25 2015, 11:55 AM

Description

I am not able to mark the VisualEditor user guide for translation after updating some parts: https://www.mediawiki.org/wiki/Help:VisualEditor/User_guide .

I get this error:

Database error

A database query error has occurred. This may indicate a bug in the software.

Function: MessageGroupStats::clearGroup
Error: 1205 Lock wait timeout exceeded; try restarting transaction (10.64.16.27)

Originally reported at https://www.mediawiki.org/wiki/Thread:Extension_talk:Translate/Failed_to_mark_for_translation

Details

Subject	Repo	Branch	Lines +/-
$wgTranslateDelayedMessageIndexRebuild = true;	operations/mediawiki-config	master	+1 -0
Add dedicated runner for MessageIndexRebuildJob	operations/puppet	production	+12 -2
Special:PageTranslation: use JobQueue for message index rebuild	mediawiki/extensions/Translate	master	+1 -1

Customize query in gerrit

Related Objects

Mentioned In: T98427: "Lock wait timeout exceeded" errors in logstash, aparently related to related to MessageUpdateJob or MessageGroupStats::clear
rOMWC918601ab7811: $wgTranslateDelayedMessageIndexRebuild = true;
rOPUPf69a9b67618a: Add dedicated runner for MessageIndexRebuildJob
rMEXTab6dcdcdcc06: Updated mediawiki/extensions Project: mediawiki/extensions/Translate…
rETRAffc5d4eecbfa: Special:PageTranslation: use JobQueue for message index rebuild
Mentioned Here: T48716: Translation page does not contain the latest translations/last translation
T98427: "Lock wait timeout exceeded" errors in logstash, aparently related to related to MessageUpdateJob or MessageGroupStats::clear
T91166: Some translations is not saved in Meta-Wiki: Unknown error: "tpt-unknown-page"
T72153: MessageGroupStats co-operative deadlock with transactions and GET_LOCK

Event Timeline

Amire80 created this task.Feb 25 2015, 11:55 AM

Amire80 raised the priority of this task from to Medium.

Amire80 updated the task description. (Show Details)

Amire80 added projects: VisualEditor, I18n, MediaWiki-extensions-Translate.

Amire80 added subscribers: Amire80, • Nikerabbit.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 25 2015, 11:55 AM

Amire80 raised the priority of this task from Medium to Unbreak Now!.Feb 25 2015, 6:54 PM

Amire80 added a project: LE-Sprint-83.

Amire80 set Security to None.

Jdforrester-WMF moved this task from To Triage to Freezer on the VisualEditor board.Feb 25 2015, 7:09 PM

There seems to be a general issue of too many and/or inefficient write queries against the message group stats table.

Some work is on-going here on Aaron's initiative:

But there is something else going on as well, let me explain. The act of (re-)marking page for translation roughly goes as follows:

Make edit to the source page to insert section ids
Delete and insert sections to translate_sections table
Update value in translate_metadata table
Update value in revtag table
Clear message group cache (which is essentially serialized MessageGroup objects stored in a BagOStuff cache)
Create jobs to update all assorted pages and push them to jobqueue
Add entry to logging table
Attempt to invalidate squid etc caches for the source page
Rebuild message index

My expectation is that MessageGroupStats::clearGroup() would be called in during the last step above, message index rebuilding. See this code. At this point all the really necessary actions have been done, so the page should have been marked successfully even with the timeout here. But according to Amir the page does not get marked. I can confirm this, because there is no log entry since January.

I can think of two possibilities:

MessageGroupStats::clearGroup gets called earlier in the process. A backtrace would help to see if this is the case.
Some of the previous actions have either not been committed to the database, or were rolled back.

Grep find three places where MessageGroupStats::clearGroup() is called:

SpecialPageTranslation: it should only be called when action is discourage or encourage, not with mark
SpecialMessageGroupStats: irrelevant
MessageIndex: the one I pointed out above

There is one other code path that could trigger MessageIndexRebuild and thus MessageGroupStats::clearGroup, but I have never seen that at translatewiki.net, so I don't really believe in that either.

@aaron, if you or anyone else in platform have any ideas what is going, I would be very happy, since I don't really know what is going on here.

• Nikerabbit claimed this task.Feb 26 2015, 10:13 AM

• Elitre subscribed.Feb 26 2015, 10:56 AM

Nemo_bis renamed this task from not able to mark a page for translation at mediawiki.org to Not able to mark big pages for translation at mediawiki.org.Feb 26 2015, 11:55 AM

Nemo_bis updated the task description. (Show Details)

Nemo_bis removed a project: VisualEditor.

Nemo_bis subscribed.

• Nikerabbit edited a custom field.Feb 26 2015, 2:00 PM

• Nikerabbit moved this task from Backlog to Blocked on the LE-Sprint-83 board.

Shirayuki subscribed.Feb 26 2015, 9:49 PM

bd808 subscribed.Mar 2 2015, 4:54 PM

According to Aaron the database exception caused by the lock wait timeout can bubble up and indeed cause a rollback.

Instead of trying to handle this case specifically, which can be complicated, it seems we should reduce the contention on that table. To do that I need more info what is causing the most contention there.

Nemo_bis added a subscriber: • Springle.Mar 5 2015, 3:06 PM

Arrbee edited projects, added LE-Sprint-84; removed LE-Sprint-83.Mar 11 2015, 6:11 AM

Arrbee moved this task from Backlog to Blocked on the LE-Sprint-84 board.

May not be relevant any longer, but this recalled T72153 to mind. We'll have to poll the master innodb status to catch the current problem.

I still see some bursty lock wait timeouts from the runners. Looking at the code again, the MessageUpdateJob job lock problems aren't even the issue here, since that happens in some totally separate transaction in a runner (and gets retried). I don't see why that would break the stuff in the main transaction, such as the log table updates. For March 1-16 I did see 5 DB errors in markForTranslation(). Most were in MessageIndexRebuildJob::newJob()->run() (the last step of the special page transaction) and were lock wait timeouts. I also so one in doEditContent() that was a deadlock in updateRevisionOn(). I didn't see any exceptions from the isValid() path.

Why isn't the MessageIndexRebuildJob just enqueued instead of run on the spot? That stop db errors from bubbling up. I suppose you could use try/catch or onTransactionIdle() though that is really slow with lock-wait timeouts.

If we change MessageIndexRebuildJob::newJob()->run(); to MessageIndexRebuildJob::newJob()->insert(); it will follow the setting $wgTranslateDelayedMessageIndexRebuild which I expect to be true on WMF. The downside is that people complain loudly if the job is not executed swiftly, for example T91166. Can we configure that job to be high priority?

KartikMistry added subscribers: KartikMistry, Jdforrester-WMF.Mar 18 2015, 5:45 PM

KartikMistry removed a subscriber: Jdforrester-WMF.

I suppose it could have a dedicated loop in the jobrunner JSON config.

Change 197918 had a related patch set uploaded (by Nikerabbit):
Special:PageTranslation: use JobQueue for message index rebuild

https://gerrit.wikimedia.org/r/197918

Change 197919 had a related patch set uploaded (by Nikerabbit):
Add dedicated runner for MessageIndexRebuildJon

https://gerrit.wikimedia.org/r/197919

Change 197920 had a related patch set uploaded (by Nikerabbit):
$wgTranslateDelayedMessageIndexRebuild = true;

https://gerrit.wikimedia.org/r/197920

• Nikerabbit added a project: Blocked-on-MediaWiki-Core.Mar 23 2015, 12:41 PM

Change 197918 merged by jenkins-bot:
Special:PageTranslation: use JobQueue for message index rebuild

https://gerrit.wikimedia.org/r/197918

• Nikerabbit mentioned this in rETRAffc5d4eecbfa: Special:PageTranslation: use JobQueue for message index rebuild.Mar 24 2015, 5:36 PM

Diffusion mentioned this in rMEXTab6dcdcdcc06: Updated mediawiki/extensions Project: mediawiki/extensions/Translate….Mar 24 2015, 5:37 PM

• Nikerabbit moved this task from Backlog to page translation on the MediaWiki-extensions-Translate board.Mar 27 2015, 5:56 PM

Arrbee edited projects, added LE-Sprint-85; removed LE-Sprint-84.Mar 30 2015, 10:30 AM

Change 197919 merged by Filippo Giunchedi:
Add dedicated runner for MessageIndexRebuildJob

https://gerrit.wikimedia.org/r/197919

fgiunchedi mentioned this in rOPUPf69a9b67618a: Add dedicated runner for MessageIndexRebuildJob.Mar 30 2015, 3:19 PM

I have added the configuration patch for SWAT today. It looks like someone has managed to mark the referenced page for translation in the mean while, hence lowering the priority. The work done for this is, nevertheless, useful and should speed up various actions related to translatable pages.

Change 197920 merged by jenkins-bot:
$wgTranslateDelayedMessageIndexRebuild = true;

https://gerrit.wikimedia.org/r/197920

MarkTraceur mentioned this in rOMWC918601ab7811: $wgTranslateDelayedMessageIndexRebuild = true;.Mar 31 2015, 3:26 PM

Nemo_bis closed this task as Resolved.Mar 31 2015, 3:49 PM

• Nikerabbit removed projects: Blocked-on-MediaWiki-Core, Patch-For-Review.Mar 31 2015, 5:42 PM

Pginer-WMF moved this task from Backlog to Done on the LE-Sprint-85 board.Apr 1 2015, 8:23 AM

BBlack mentioned this in T98427: "Lock wait timeout exceeded" errors in logstash, aparently related to related to MessageUpdateJob or MessageGroupStats::clear.May 7 2015, 1:14 AM

Re-opening, as it still seems to be happening (see T98427).

This may be related to recent innodb_lock_wait_timeout change:

https://gerrit.wikimedia.org/r/#/c/206442/

Was 50sec, now 15sec. This is probably ok for wikiuser, but need to establish if it should apply to wikiadmin.

Perhaps this should be separate bug (unless marking pages for translation fails again, which I haven't seen reports of).

How to get more info what is going on? I can only guess that we are still getting too many concurrent updates via page views of Special:(Language|MessageGroups)Stats. Or it could be the jobs that update workflow states.

@Nikerabbit, we suspected that it is the same root cause. If that's not the case, then please reopen T98427 & close this task.

unless marking pages for translation fails again, which I haven't seen reports of

How to verify? With the job queue, the user doesn't get immediate feedback upon errors; the easiest thing to check are FuzzyBot edits, but these are separately broken by T48716; adding a new page for translation and see whether the message group was created might be the way, but it needs to be a big page to trigger the error.

• Nikerabbit removed • Nikerabbit as the assignee of this task.May 24 2015, 11:16 AM

• Nikerabbit removed a project: LE-Sprint-85.

Glaisher subscribed.May 25 2015, 4:05 PM

Nemo_bis closed this task as Resolved.May 31 2015, 5:52 PM

Nemo_bis claimed this task.

Not able to mark big pages for translation at mediawiki.orgClosed, ResolvedPublic1 Estimated Story PointsActions

Description

Details

Related Objects

Event Timeline

Not able to mark big pages for translation at mediawiki.org
Closed, ResolvedPublic1 Estimated Story Points
Actions