Page MenuHomePhabricator

Not able to mark big pages for translation at mediawiki.org
Closed, ResolvedPublic1 Estimated Story Points

Description

I am not able to mark the VisualEditor user guide for translation after updating some parts: https://www.mediawiki.org/wiki/Help:VisualEditor/User_guide .

I get this error:

Database error

A database query error has occurred. This may indicate a bug in the software.

Function: MessageGroupStats::clearGroup
Error: 1205 Lock wait timeout exceeded; try restarting transaction (10.64.16.27)

Originally reported at https://www.mediawiki.org/wiki/Thread:Extension_talk:Translate/Failed_to_mark_for_translation

Event Timeline

Amire80 raised the priority of this task from to Medium.
Amire80 updated the task description. (Show Details)
Amire80 added subscribers: Amire80, Nikerabbit.
Amire80 raised the priority of this task from Medium to Unbreak Now!.Feb 25 2015, 6:54 PM
Amire80 added a project: LE-Sprint-83.
Amire80 set Security to None.

There seems to be a general issue of too many and/or inefficient write queries against the message group stats table.

Some work is on-going here on Aaron's initiative:

But there is something else going on as well, let me explain. The act of (re-)marking page for translation roughly goes as follows:

  1. Make edit to the source page to insert section ids
  2. Delete and insert sections to translate_sections table
  3. Update value in translate_metadata table
  4. Update value in revtag table
  5. Clear message group cache (which is essentially serialized MessageGroup objects stored in a BagOStuff cache)
  6. Create jobs to update all assorted pages and push them to jobqueue
  7. Add entry to logging table
  8. Attempt to invalidate squid etc caches for the source page
  9. Rebuild message index

My expectation is that MessageGroupStats::clearGroup() would be called in during the last step above, message index rebuilding. See this code. At this point all the really necessary actions have been done, so the page should have been marked successfully even with the timeout here. But according to Amir the page does not get marked. I can confirm this, because there is no log entry since January.

I can think of two possibilities:

  1. MessageGroupStats::clearGroup gets called earlier in the process. A backtrace would help to see if this is the case.
  2. Some of the previous actions have either not been committed to the database, or were rolled back.

Grep find three places where MessageGroupStats::clearGroup() is called:

  1. SpecialPageTranslation: it should only be called when action is discourage or encourage, not with mark
  2. SpecialMessageGroupStats: irrelevant
  3. MessageIndex: the one I pointed out above

There is one other code path that could trigger MessageIndexRebuild and thus MessageGroupStats::clearGroup, but I have never seen that at translatewiki.net, so I don't really believe in that either.

@aaron, if you or anyone else in platform have any ideas what is going, I would be very happy, since I don't really know what is going on here.

Nemo_bis renamed this task from not able to mark a page for translation at mediawiki.org to Not able to mark big pages for translation at mediawiki.org.Feb 26 2015, 11:55 AM
Nemo_bis updated the task description. (Show Details)
Nemo_bis removed a project: VisualEditor.
Nemo_bis subscribed.

According to Aaron the database exception caused by the lock wait timeout can bubble up and indeed cause a rollback.

Instead of trying to handle this case specifically, which can be complicated, it seems we should reduce the contention on that table. To do that I need more info what is causing the most contention there.

Arrbee moved this task from Backlog to Blocked on the LE-Sprint-84 board.

May not be relevant any longer, but this recalled T72153 to mind. We'll have to poll the master innodb status to catch the current problem.

I still see some bursty lock wait timeouts from the runners. Looking at the code again, the MessageUpdateJob job lock problems aren't even the issue here, since that happens in some totally separate transaction in a runner (and gets retried). I don't see why that would break the stuff in the main transaction, such as the log table updates. For March 1-16 I did see 5 DB errors in markForTranslation(). Most were in MessageIndexRebuildJob::newJob()->run() (the last step of the special page transaction) and were lock wait timeouts. I also so one in doEditContent() that was a deadlock in updateRevisionOn(). I didn't see any exceptions from the isValid() path.

Why isn't the MessageIndexRebuildJob just enqueued instead of run on the spot? That stop db errors from bubbling up. I suppose you could use try/catch or onTransactionIdle() though that is really slow with lock-wait timeouts.

If we change MessageIndexRebuildJob::newJob()->run(); to MessageIndexRebuildJob::newJob()->insert(); it will follow the setting $wgTranslateDelayedMessageIndexRebuild which I expect to be true on WMF. The downside is that people complain loudly if the job is not executed swiftly, for example T91166. Can we configure that job to be high priority?

I suppose it could have a dedicated loop in the jobrunner JSON config.

Change 197918 had a related patch set uploaded (by Nikerabbit):
Special:PageTranslation: use JobQueue for message index rebuild

https://gerrit.wikimedia.org/r/197918

Change 197919 had a related patch set uploaded (by Nikerabbit):
Add dedicated runner for MessageIndexRebuildJon

https://gerrit.wikimedia.org/r/197919

Change 197920 had a related patch set uploaded (by Nikerabbit):
$wgTranslateDelayedMessageIndexRebuild = true;

https://gerrit.wikimedia.org/r/197920

Change 197918 merged by jenkins-bot:
Special:PageTranslation: use JobQueue for message index rebuild

https://gerrit.wikimedia.org/r/197918

Change 197919 merged by Filippo Giunchedi:
Add dedicated runner for MessageIndexRebuildJob

https://gerrit.wikimedia.org/r/197919

Nikerabbit lowered the priority of this task from Unbreak Now! to Medium.Mar 31 2015, 10:47 AM

I have added the configuration patch for SWAT today. It looks like someone has managed to mark the referenced page for translation in the mean while, hence lowering the priority. The work done for this is, nevertheless, useful and should speed up various actions related to translatable pages.

Change 197920 merged by jenkins-bot:
$wgTranslateDelayedMessageIndexRebuild = true;

https://gerrit.wikimedia.org/r/197920

This may be related to recent innodb_lock_wait_timeout change:

https://gerrit.wikimedia.org/r/#/c/206442/

Was 50sec, now 15sec. This is probably ok for wikiuser, but need to establish if it should apply to wikiadmin.

Perhaps this should be separate bug (unless marking pages for translation fails again, which I haven't seen reports of).

How to get more info what is going on? I can only guess that we are still getting too many concurrent updates via page views of Special:(Language|MessageGroups)Stats. Or it could be the jobs that update workflow states.

@Nikerabbit, we suspected that it is the same root cause. If that's not the case, then please reopen T98427 & close this task.

unless marking pages for translation fails again, which I haven't seen reports of

How to verify? With the job queue, the user doesn't get immediate feedback upon errors; the easiest thing to check are FuzzyBot edits, but these are separately broken by T48716; adding a new page for translation and see whether the message group was created might be the way, but it needs to be a big page to trigger the error.

Nemo_bis claimed this task.