Page MenuHomePhabricator

Translation page does not contain the latest translations/last translation
Closed, ResolvedPublic

Description

Current status

We are not currently seeing indications in the logs that pages are getting out of sync due to replication lag.

So far we have:

  • Added statement to wait for replication (up until a timeout), which seemingly did not fix the problem as we are reaching the timeout
  • Fixed one case of potentially using replica instead of a master
  • Identified that there are inefficient transactions that may hinder the wait for replication logic working as intended
  • Switches post-translation updates (TranslateRenderJob) to use JobQueue
  • Updated all translation pages using refresh-translatable-pages.php script.

Original report

For eg: Saving https://meta.wikimedia.org/w/index.php?title=Steward_requests/Global_permissions/GS-Header&diff=20168333&oldid=19016463 did not updated the result on https://meta.wikimedia.org/wiki/Special:MyLanguage/Steward_requests/Global_permissions so I have to do a dummy edit https://meta.wikimedia.org/w/index.php?title=Steward_requests/Global_permissions/GS-Header&diff=20168363&oldid=20168333 to make it work.

Previous reports about similar issues:

QA plan

Affected projects: Wikimedia projects with Translate
QA person: @Nikerabbit / @abi_
QA task post deployment:

  • Test (re-)marking page for translations, make changes to the template and see if those are applied after re-marking without need of dummy edits and re-marking.
  • Monitor logs for "Continuing despite replication lag"

Outcome

All translation pages are now show all the latest changes. Translation pages should no longer get out of sync because of replication lag.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I would like to get more feedback to help investigate the task. So far I see a couple of reports where a template update did not show up in a page where the template is tranclused. This is something that is likely out of our control, as MediaWiki core handles dependency tracking and refreshing of pages due to dependency changes.

The example of Nemo_bis is different, however, because even though it is a template page, it clearly shows that a change to the template was not updated to the template itself until a dummy edit happened. This is clearly an issue with our code.

So please, do report more examples of this issue, and do provide the following info:

  • Is it related to updating translations, or when marking a page for translation
  • Link to the original edit
  • Link that shows when the page was marked for translation (if applicable)
  • Link to the page where you did not see the edit, and timestamp when you checked¹

¹ We rely on the job queue for many operations, so for example updating of translation pages is not instant. Do wait at least a few minutes to see if it is just a normal delay.

Looking at example Nemo_bis linked:

WhenWhat
Andre made an edit to Template:OPW2020-05-31T19:19:00‎Z
Shirayuki re-marked the page for translation2020-05-31T22:09:25Z
Nemo_bis made a dummy edit2020-06-12T08:21:24‎Z
Nemo_bis re-marked the page for ftranslation2020-06-12T08:21:32Z
FuzzyBot updates Template:OPW/en¹2020-06-12T08:21:32‎Z

¹It includes changes Andre made, but not the one Nemo_bis made.

Here are logs from around the time Shirayuki re-marked the page for translation:

NOTE: Since Logstash only has one second precisions for events, the ordering of messages may not be accurate.
2020-05-31T22:09:25mw1311INFO[Job: TranslationsUpdateJob][Request ID: 3e0e5da2-b696-4e13-940b-cd5ea86ca2bc][Title: Template:OPW] Starting TranslationsUpdateJob
2020-05-31T22:09:26mw1311INFO[Job: TranslationsUpdateJob][Request ID: 3e0e5da2-b696-4e13-940b-cd5ea86ca2bc][Title: Template:OPW] Finished running 3 MessageUpdate jobs for 3 sections
2020-05-31T22:09:26mw1311INFO[Job: TranslationsUpdateJob][Request ID: 3e0e5da2-b696-4e13-940b-cd5ea86ca2bc][Title: Template:OPW] Cleared caches
2020-05-31T22:09:26mw1311INFO[Job: TranslationsUpdateJob][Request ID: 3e0e5da2-b696-4e13-940b-cd5ea86ca2bc][Title: Template:OPW] Updated the message group stats
2020-05-31T22:09:26mw1311INFO[Job: TranslationsUpdateJob][Request ID: 3e0e5da2-b696-4e13-940b-cd5ea86ca2bc][Title: Template:OPW] Finished purging
2020-05-31T22:09:26mw1311INFO[Job: TranslationsUpdateJob][Request ID: 3e0e5da2-b696-4e13-940b-cd5ea86ca2bc][Title: Template:OPW] Added 3 RenderJobs to the queue
2020-05-31T22:09:26mw1334INFO[Job: TranslateRenderJob][Request ID: 3e0e5da2-b696-4e13-940b-cd5ea86ca2bc][Title: Template:OPW/ja] Starting TranslateRenderJob
2020-05-31T22:09:26mw1302INFO[Job: TranslateRenderJob][Request ID: 3e0e5da2-b696-4e13-940b-cd5ea86ca2bc][Title: Template:OPW/en] Starting TranslateRenderJob
2020-05-31T22:09:26mw1311INFO[Job: TranslationsUpdateJob][Request ID: 3e0e5da2-b696-4e13-940b-cd5ea86ca2bc][Title: Template:OPW] Finished TranslationsUpdateJob
2020-05-31T22:09:26mw1310INFO[MessageIndex] Started rebuild. Initiated by MediaWiki\Extension\EventBus\JobExecutor->execute/MessageIndexRebuildJob->run/MessageIndex->rebuild
2020-05-31T22:09:26mw1310INFO[MessageIndex] Got lock in 0.00070500373840332
2020-05-31T22:09:27mw1334INFO[Job: TranslateRenderJob][Request ID: 3e0e5da2-b696-4e13-940b-cd5ea86ca2bc][Title: Template:OPW/ja] Finished page edit operation
2020-05-31T22:09:27mw1334INFO[Job: TranslateRenderJob][Request ID: 3e0e5da2-b696-4e13-940b-cd5ea86ca2bc][Title: Template:OPW/ja] Finished TranslateRenderJob
2020-05-31T22:09:27mw1302INFO[Job: TranslateRenderJob][Request ID: 3e0e5da2-b696-4e13-940b-cd5ea86ca2bc][Title: Template:OPW/en] Finished TranslateRenderJob
2020-05-31T22:09:27mw1302INFO[Job: TranslateRenderJob][Request ID: 3e0e5da2-b696-4e13-940b-cd5ea86ca2bc][Title: Template:OPW/en] Finished page edit operation
2020-05-31T22:09:31mw1310INFO[MessageIndex] Finished critical section in 4.7386178970337
NOTE: I used wiki:mediawikiwiki AND (channel:Translate OR channel:Translate.Jobs) as the Logstash query

I also checked other logs from wiki:mediawikiwiki between 2020-05-31 22:09:23.939 and 2020-05-31 22:09:32.620 and no relevant warnings or errors.

Conclusion: There seems to be still the case where TranslateRenderJob does not see the latest version.

Also, rETRA09d11f488fc0: Split message index rebuild off of TranslationsUpdateJob and enable… is a recent change that may have introduced the regression.

Some quick theoretical musings:

The above code change moved slow message index rebuild job out of the critical path. Now TanslateRenderJobs are processed almost immediately. Previously the slow message index rebuild would hide any replication lag induced issues.

Both loading the message definitions and translations are guarded by:

	public static function getSafeReadDB() {
		$lb = MediaWikiServices::getInstance()->getDBLoadBalancer();
		$index = self::shouldReadFromMaster() ? DB_MASTER : DB_REPLICA;

		return $lb->getConnection( $index );
	}

Which uses:

	public static function shouldReadFromMaster() {
		$lb = MediaWikiServices::getInstance()->getDBLoadBalancer();
		// Parsing APIs need POST for payloads but are read-only, so avoid spamming
		// the master then. No good way to check this at the moment...
		if ( PageTranslationHooks::$renderingContext ) {
			return false;
		}

		return PHP_SAPI === 'cli' ||
			RequestContext::getMain()->getRequest()->wasPosted() ||
			$lb->hasOrMadeRecentMasterChanges();
	}

Looking at the code for Special:RunJobs, it requires the use of POST, so this should always return DB_MASTER in the context of TranslateRenderJob at Wikimedia wikis.

Then I realize, there is still third item that is loaded separately: the template (text outside <translate> tags), and this is exactly the case with this example. Inspecting the code, I see TranslatablePage::getText does not use getSafeReadDB.

Also, the job itself could use LoadBalancer's getMasterPos and waitFor to make all this redundant. But I don't see any extensions using it so it seems a bit risky.

Change 607268 had a related patch set uploaded (by Nikerabbit; owner: Nikerabbit):
[mediawiki/extensions/Translate@master] TranslatablePage: Apply replica/master logic to also getText

https://gerrit.wikimedia.org/r/607268

Change 607268 merged by jenkins-bot:
[mediawiki/extensions/Translate@master] TranslatablePage: Apply replica/master logic to also getText

https://gerrit.wikimedia.org/r/607268

I would like to get more feedback to help investigate the task. So far I see a couple of reports where a template update did not show up in a page where the template is tranclused. This is something that is likely out of our control, as MediaWiki core handles dependency tracking and refreshing of pages due to dependency changes.

The example of Nemo_bis is different, however, because even though it is a template page, it clearly shows that a change to the template was not updated to the template itself until a dummy edit happened. This is clearly an issue with our code.

So please, do report more examples of this issue, and do provide the following info:

  • Is it related to updating translations, or when marking a page for translation
  • Link to the original edit
  • Link that shows when the page was marked for translation (if applicable)
  • Link to the page where you did not see the edit, and timestamp when you checked¹

¹ We rely on the job queue for many operations, so for example updating of translation pages is not instant. Do wait at least a few minutes to see if it is just a normal delay.

Well looking at template page again it's same issue as Nemo_bis highlighted. See following for example:
Main page which have changes with edit summary dummy edit https://meta.wikimedia.org/w/index.php?title=Template:Steward_requests/Username_changes&action=history
Translation page which do not have latest changes https://meta.wikimedia.org/w/index.php?title=Template:Steward_requests/Username_changes/es&action=history

Change 615408 had a related patch set uploaded (by Nikerabbit; owner: Nikerabbit):
[mediawiki/extensions/Translate@master] Wait for replication in jobs that require up to date data

https://gerrit.wikimedia.org/r/615408

Change 615408 merged by jenkins-bot:
[mediawiki/extensions/Translate@master] Wait for replication in jobs that require up to date data

https://gerrit.wikimedia.org/r/615408

Meta should get latest proposed fix tonight with 1.36.0-wmf.2. After the deployment, I would appreciate feedback whether it changes anything.

Appears to still be happening, at least some of the time.

Example on MediaWiki.org:

  1. 19:12, 29 July 2020, an IP adds a screenshot to Skin:Timeless
  2. 13:42, 30 July 2020, Shirayuki re-marks the page for translation.
  3. FuzzyBot makes no edits to the translation pages, even though the source text was changed.

Still happening on metawiki:

*Latest change on main page is 5 August 2020

*Latest change on translation page is also 5 August but is of previous edit on main page and does not contain 5 August 2020 main page edit.

Meta should get latest proposed fix tonight with 1.36.0-wmf.2. After the deployment, I would appreciate feedback whether it changes anything.

"Failed to load translation aids: Title does not correspond to a translatable message"

Required a dummy edit.

Required a dummy edit.

Actually, dummy edits are no longer needed, even when this bug happens. Since T256868 (6959a02 to be exact), pages can be marked for translation twice (or however many times you want) without dummy edits in between. Just go to Special:PageTranslation, find the page and press mark for translationMark this version for translation. In my experience, this always works around the bug just like if there was a dummy edit.

Required a dummy edit.

Actually, dummy edits are no longer needed, even when this bug happens. Since T256868 (6959a02 to be exact), pages can be marked for translation twice (or however many times you want) without dummy edits in between. Just go to Special:PageTranslation, find the page and press mark for translationMark this version for translation. In my experience, this always works around the bug just like if there was a dummy edit.

I wonder if a dummy edit from me would have also worked, but re-marking for translation is completely out of the question for me.

Required a dummy edit.

Actually, dummy edits are no longer needed, even when this bug happens. Since T256868 (6959a02 to be exact), pages can be marked for translation twice (or however many times you want) without dummy edits in between. Just go to Special:PageTranslation, find the page and press mark for translationMark this version for translation. In my experience, this always works around the bug just like if there was a dummy edit.

I wonder if a dummy edit from me would have also worked, but re-marking for translation is completely out of the question for me.

Sorry, I haven’t checked the diff, it was just a general response for all translation administrators’ information (@Quiddity in this specific case). A dummy edit on its own doesn’t have any effect, it just lets the mark for translation link appear on the page. The only thing you can achieve with a dummy edit as a non-TA is poking translation admins (which may be useful depending on the situation, but doesn’t magically resolve the issue).

Ah, thank you @Tacsipacsi for that tip.
@AlexisJazz I believe the underlying-problem that I was trying to fix is related to a different bug, T239921: Translate extension failing to load translation aids (per the same error message, "Title does not correspond to a translatable message"), not this one?

Looking at Logstash, for logs related to replication timeout, I see that it has happened 70 times since we deployed the fix. Most of these occurrences have been in the TranslateRenderJob that is responsible to update the translation pages with the content of the source page.

It also appears that these TranslateRenderJobs were triggered when a translation was made, and not when the page was marked for translation. This makes sense because when a page is marked for translation, we first wait for replication lag in TranslateUpdateJob, and then again in the TranslateRenderJob, so chances of it occurring here are much lower.

Still happening on metawiki:

*Latest change on main page is 5 August 2020

*Latest change on translation page is also 5 August but is of previous edit on main page and does not contain 5 August 2020 main page edit.

Investigating this,

  • Request Id: 48b9136c-3db6-47d6-956c-22f69b2ecc32,
  • wiki: metawiki,
  • mwversion: 1.36.0-wmf.2

Few things that stand out a bit,

  1. Sub-optimal transaction on DB(s) [xxx (metawiki) (TRX#8afcb0)]:
0	0.000694	query-m: DELETE FROM `translate_sections` WHERE trs_page = N [TRX#8afcb0]
1	0.001028	query-m: INSERT INTO `translate_sections` (trs_page,trs_key,trs_text,trs_order) VALUES (N,'X',N) [TRX#8afcb0]
2	0.000480	query-m: REPLACE INTO `translate_metadata` (tmd_group,tmd_key,tmd_value) VALUES ('X',N) [TRX#8afcb0]
3	0.000566	query-m: DELETE FROM `revtag` WHERE rt_page = N AND rt_type = 'X' AND rt_revision = N [TRX#8afcb0]
4	0.000516	query-m: INSERT INTO `revtag` (rt_page,rt_type,rt_revision) VALUES (N,'X',N) [TRX#8afcb0]
5	0.553251	...delay...
6	0.000460	query-m: DELETE FROM `translate_metadata` WHERE tmd_group = 'X' [TRX#8afcb0]
7	0.000391	query-m: DELETE FROM `translate_metadata` WHERE tmd_group = 'X' [TRX#8afcb0]
8	0.000338	query-m: DELETE FROM `translate_metadata` WHERE tmd_group = 'X' [TRX#8afcb0]
9	0.000519	query-m: INSERT INTO `logging` (log_type,log_action,log_timestamp,log_namespace,log_title,log_page,log_params,log_comment_id,log_actor) VALUES ('X',N) [TRX#8afcb0]
10	0.292869	query-m: SELECT page_id,page_title,page_namespace,page_latest,MAX(rt_revision) AS rt_revision,rt_type FROM `page`,`revtag` WHERE (page_id=rt_page) AND rt_type IN ('X') GROUP BY page_id, rt_type ORDER BY page_namespace, page_title  [TRX#8afcb0]
11	3.212778	...delay...
  1. Timeouts: Wikimedia\Rdbms\LoadBalancer::waitForMasterPos: timed out waiting on {dbserver} pos {pos} [{seconds}s] - Possibly unrelated?
#0 /srv/mediawiki/php-1.36.0-wmf.2/includes/jobqueue/jobs/CategoryMembershipChangeJob.php(109): Wikimedia\Rdbms\LoadBalancer->waitForMasterPos(Object(Wikimedia\Rdbms\DBConnRef))
#1 /srv/mediawiki/php-1.36.0-wmf.2/extensions/EventBus/includes/JobExecutor.php(79): CategoryMembershipChangeJob->run()
#2 /srv/mediawiki/rpc/RunSingleJob.php(76): MediaWiki\Extension\EventBus\JobExecutor->execute(Array)
#3 {main}

Other than that, I can't see anything else out of the ordinary.

Leaving in Recheck after deployment to gather some more feedback about more occurrences of this issue.

The last SELECT (and the time it takes to process the results) are holding the transaction open longer than necessary. https://www.mediawiki.org/wiki/Database_transaction has a bunch of useful information.

I think easiest option is to not show the list of all pages after SpecialPageTranslation::markForTranslation is called. Just show a success/failure and button/link to go to back to the listing.

The 0.5s delay in the middle is caused by a different thing and is harder to avoid, likely requiring large refactorings.

I have an issue that seems to be related to this ticket.

I just tried to mark a page for translation. Marking the page apparently works since I have access to the translation interface, but the translation memory is not working, and translations I've already made are not visible. I tried to mark the page for translations again, but it hasn't solved anything.

How can it be solved?

How can it be solved?

Johan made a null edit to the first item in my existing translation and now it works.

Change 621034 had a related patch set uploaded (by Abijeet Patro; owner: Abijeet Patro):
[mediawiki/extensions/Translate@master] Do not list all pages on marking a page for translation

https://gerrit.wikimedia.org/r/621034

Change 621034 merged by jenkins-bot:
[mediawiki/extensions/Translate@master] Do not list all pages on marking a page for translation

https://gerrit.wikimedia.org/r/621034

Update: Still seeing occurrences of the log message: Continuing despite replication lag in TranslateRenderJobs. This statement is still applicable:

It also appears that these TranslateRenderJobs were triggered when a translation was made, and not when the page was marked for translation. This makes sense because when a page is marked for translation, we first wait for replication lag in TranslateUpdateJob, and then again in the TranslateRenderJob, so chances of it occurring here are much lower.

46 occurrences since last deployment - https://logstash-next.wikimedia.org/app/discover#/?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:'2020-09-08T00:00:00.000Z',to:now))&_a=(columns:!(_source),filters:!(),index:'2d891220-161a-11ea-a364-c747e6d6cfc2',interval:auto,query:(language:kuery,query:'(channel:%20%22Translate.Jobs%22%20AND%20message:%22Continuing%20despite%20replication%20lag%22)'),sort:!())

Looking at req Id: c7449317-810e-4a50-85f1-4f94440fbd66

I see the following error: Timed out waiting for replication to reach 180355192-180355192-99800028

Stack trace:

#0 /srv/mediawiki/php-1.36.0-wmf.8/includes/libs/rdbms/loadbalancer/LoadBalancer.php(874): Wikimedia\Rdbms\DatabaseMysqlBase->masterPosWait(Wikimedia\Rdbms\MySQLMasterPos, integer)
#1 /srv/mediawiki/php-1.36.0-wmf.8/includes/libs/rdbms/loadbalancer/LoadBalancer.php(713): Wikimedia\Rdbms\LoadBalancer->doWait(integer, boolean, integer)
#2 /srv/mediawiki/php-1.36.0-wmf.8/includes/libs/rdbms/lbfactory/LBFactory.php(476): Wikimedia\Rdbms\LoadBalancer->waitForAll(Wikimedia\Rdbms\MySQLMasterPos, integer)
#3 /srv/mediawiki/php-1.36.0-wmf.8/extensions/Translate/tag/TranslateRenderJob.php(48): Wikimedia\Rdbms\LBFactory->waitForReplication()
#4 /srv/mediawiki/php-1.36.0-wmf.8/extensions/Translate/tag/PageTranslationHooks.php(253): TranslateRenderJob->run()
#5 /srv/mediawiki/php-1.36.0-wmf.8/extensions/Translate/tag/PageTranslationHooks.php(231): PageTranslationHooks::updateTranslationPage(TranslatablePage, string, User, integer, string)
#6 /srv/mediawiki/php-1.36.0-wmf.8/includes/deferred/MWCallableUpdate.php(38): PageTranslationHooks::{closure}()
#7 /srv/mediawiki/php-1.36.0-wmf.8/includes/deferred/DeferredUpdates.php(467): MWCallableUpdate->doUpdate()
#8 /srv/mediawiki/php-1.36.0-wmf.8/includes/deferred/DeferredUpdates.php(344): DeferredUpdates::attemptUpdate(MWCallableUpdate, Wikimedia\Rdbms\LBFactoryMulti)
#9 /srv/mediawiki/php-1.36.0-wmf.8/includes/deferred/DeferredUpdates.php(278): DeferredUpdates::run(MWCallableUpdate, Wikimedia\Rdbms\LBFactoryMulti, Monolog\Logger, BufferingStatsdDataFactory, string)
#10 /srv/mediawiki/php-1.36.0-wmf.8/includes/deferred/DeferredUpdates.php(194): DeferredUpdates::handleUpdateQueue(array, string, integer)

I think this is a by-product of calling - waitForReplication.

Questions for Performance-Team: We have implemented waitForReplication in our job to avoid reading stale data. However, log shows that it timeouts a few times per day across production wikis. I checked a couple of them, and the jobs actually got correct data despite this. Is it safe to ignore this message? Is there anything further we could/should do to avoid replication lag issues?

The default timeout for waitForReplication() for web requests is 1 second. JobRunner passes 3 seconds to its calls and sets setDefaultReplicationWaitTimeout( 3 ). Maybe TranslateRenderJob should use a different timeout?

Are these running in deferred updates of jobs themselves run inside a JobRunner script or are they just running at the end of regular web requests (e.g. $wgJobRunRate)?

Abijeet recently created docs about our jobs: https://www.mediawiki.org/wiki/Help:Extension:Translate/Process_flow_in_MediaWiki_jobs.

This is on Wikimedia production, where $wgJobRunRate is set to 0 to my knowledge.

Specifically, on PageSaveComplete hook handler <1> we create a deferred update <2> that immediately runs TranslateRenderJob <3>.

Now, after looking that up and explaining that, I think this explains why this code path triggers warnings and the other code path doesn't. To summarize:

  • No log warnings appear: Special:PageTranslation inserts TranslationsUpdateJob (wait up to 3s) which inserts TranslateRenderJob (waits up to 3s)
  • Log warnings appear: PageSaveComplete hook handler creates a deferred update which runs TranslateRenderJob (waits up to 1s)

It seems that we should try to queue a job instead of running it in a deferred update. These updates should happen fast, but not necessary "deferred update" fast. Moving these updates into a job queue gains us +2s additional wait time which should eliminate most log warnings. There are a few other callers of this code (hooks on page moves and page deletions), but on a quick look, same logic can be applied to them. The only downside I see is that for non-Wikimedia wikis the delay for running TranslateRenderJobs could get unacceptably long. However, the requirement for a working job is a documented on installation instructions and if user reports come up, we could switch to previous behavior based on either $wgJobRunRate or another configuration option.

Change 632489 had a related patch set uploaded (by Nikerabbit; owner: Nikerabbit):
[mediawiki/extensions/Translate@master] Defer TranslateRenderJobs for saves, moves and deletions

https://gerrit.wikimedia.org/r/632489

Change 632489 merged by jenkins-bot:
[mediawiki/extensions/Translate@master] Defer TranslateRenderJobs for saves, moves and deletions

https://gerrit.wikimedia.org/r/632489

Logs to be checked after deployment. If ok, let's not forget to run refresh-translatable-pages.php script.

Last log entry for "Continuing despite replication lag" was on 2020-10-14 (wmf.11). The latest fix seems to have done the trick.

Will start running refresh-translatable-pages.php scripts.

Script run for all non-closed except metawiki and mediawikiwiki.

Now also run for metawiki and mediawikiwiki.