Page MenuHomePhabricator

"Translate this page" leads to wrong page
Closed, ResolvedPublic16 Estimated Story PointsBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

What happens?:

What should have happened instead?:

  • Loaded the page it initially linked to

After doing a purge, the link started working properly in the initial window...
But if I open that link in a private-window, it still initially loads the wrong link.
And if it does send me to the correct page, the translation blocks often show the yellow error message: "Failed to load translation aids: Title does not correspond to a translatable message"

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Latter was renamed and marked translation for multiple times: https://meta.wikimedia.org/w/index.php?title=Special:Log&page=Leadership+Development+Working+Group%2FContent - lots of logs but hard to find what could be a clue.

Former is simpler. It has a plenty of "[MessageHandle] MessageIndex is out of date. Page {pagename} refers to unknown group {messagegroup}". There is also plenty of "MessageIndexRebuildJob [Special:MyLanguage/Main Page]: MessageIndex: unable to acquire lock" errors. Latter is probably cause for the former, but what causes this error is not clear. Needs more investigation.

And it's not clear how this relates to the this issue. Message group cache is separate from the message index, and the jobs were running successfully as far as I can see. It's also not clear why action=purge would help, as it it shouldn't trigger message group cache creation. I just wonder if doing that is enough of a delay that the observed issue goes away on it's own.

I just faced this again on Wikidata. This shows Invalid value for parameter mcgroup.

We're encountering this again, with Tech News. (https://meta.wikimedia.org/wiki/Tech/News/2022/45). Notes:

  • Usually, clicking Translate this page will send us to the (previously described) "Recent Additions" group.
  • Sometimes, it starts to load the correct page, but shows the (previously described) error Invalid value for parameter mcgroup
  • Sometimes, especially after a purge (?), it will successfully show the interface, but if we click a text-chunk then in the Suggestions sidebar it shows Failed to load translation aids: Title does not correspond to a translatable message
  • I also notice now (after many reloads and purges) that it is not showing the <languages/> bar on the page itself. But if I reload the tab, it reappears.
    • In case this is relevant, I opened the HTML and I see that the mw-pt-languages div isn't included. (I don't know if any of the green text is relevant, but it was right there, so I included it)

image.png (838×1 px, 277 KB)

  • Finally, after ~45 mins, it appears to be working smoothly again.
Nikerabbit triaged this task as Medium priority.Jun 5 2023, 11:39 AM

Has this been observed recently or has the problem gone away in the meantime?

Mmh… T334621 was probably a revival of this issue.

Someone else just experienced this at governance-wiki, so it's definitely still happening. (cf. staff-only slack link with details of the confusion. No extra diagnosis details though, hence I'm not copying anything over.)

Here's what happens when a page is marked for translation:

  1. The marked tag (tp:mark) is added to the page in the revtag table.
  2. Subsequently, the MessageGroups::singleton()->recache(); method is called to update the WANObjectCache that stores the list of all known message groups.
  3. The process to update the WANObjectCache is as follows:
    1. We fetch loaders for all the different types of message groups that we support: AggregateMessageGroupLoader, FileBasedMessageGroupLoader etc; In case of translatable pages its the TranslatablePageMessageGroupStore we are interested in. This loads all the translatable pages from the database.
    2. The database instance passed to the message group loaders is determined via the Utilities::getSafeReadDB()(see here) method that checks various conditions, including if the load balancer object had any recent or still pending writes issued against it by this PHP thread, and then determines the database instance to use.

Why we are seeing this issue:

  1. When user lands on Special:Translate, the code tries to fetch the requested group by calling MessageGroups::getGroup.
  2. If the group is not found, it reverts back to loading the default Recent additions group.

After 3.A, MessageGroups::getGroup should not be returning null. Possible causes as to why it might be happening:

  1. Utilities::getSafeReadDB() does not return the proper database instance. If a replica is used, then stale data might be getting written into WANObjectCache.
  2. WANObjectCache::getWithSetCallback returns stale value maybe due to the lockTSE parameter. Needs some more investigation.

One thing that catches my attention is that in various places we are first calling touchCheckKey and then right after regenerating the cache. This can potentially read to storm of different threads trying to rebuild the cache from stale data (though WANObjectCache may have logic to avoid storms).

To my this looks wrong. We should first regenerate the cache, and only then call touchCheckKey (if that is even needed?). Would be nice to get an expert opinion/review.

These days, I note a several minutes delay (maybe hours) for a page to be translatable.

Change 990599 had a related patch set uploaded (by Nikerabbit; author: Nikerabbit):

[mediawiki/extensions/Translate@master] MessageIndex: improve logging

https://gerrit.wikimedia.org/r/990599

Change 990599 merged by jenkins-bot:

[mediawiki/extensions/Translate@master] MessageIndex: improve logging

https://gerrit.wikimedia.org/r/990599

A user on the English Wikimedia Discord reported an issue where clicking on the "translate this page" link is failing to redirect the user to the message group in the interface.

Replicated it here: https://meta.wikimedia.org/wiki/User:Seddon/test
First encountered https://meta.wikimedia.org/wiki/Ukraine's_Cultural_Diplomacy_Month_2024

User reported briefly seeing "Invalid value for parameter mcgroup."

I also experienced a long delay when migrating https://commons.wikimedia.org/wiki/Template:PD-Hungary to the Translate extension last Sunday: I marked the page for translation at 14:12 UTC, and Special:Translate became functional around 14:30 UTC. Until then, directly editing the Translations-namespace pages worked (but only because I knew which pages I had to edit, a newbie won’t guess them), although they didn’t show the English text above the edit form.

Change 1007309 had a related patch set uploaded (by Nikerabbit; author: Nikerabbit):

[mediawiki/extensions/Translate@master] MessageIndexRebuildJob: avoid title, add caller

https://gerrit.wikimedia.org/r/1007309

Change 1007310 had a related patch set uploaded (by Nikerabbit; author: Nikerabbit):

[mediawiki/extensions/Translate@master] MessageHandle: avoid flooding the job queue with rebuild jobs

https://gerrit.wikimedia.org/r/1007310

Change 1007311 had a related patch set uploaded (by Nikerabbit; author: Nikerabbit):

[mediawiki/extensions/Translate@master] MessageIndex: improve logging

https://gerrit.wikimedia.org/r/1007311

Change 1007312 had a related patch set uploaded (by Nikerabbit; author: Nikerabbit):

[mediawiki/extensions/Translate@master] DatabaseMessageIndex: reduce lock wait time

https://gerrit.wikimedia.org/r/1007312

Change 1007316 had a related patch set uploaded (by Nikerabbit; author: Nikerabbit):

[mediawiki/extensions/Translate@master] MessageIndex: improve rebuilding

https://gerrit.wikimedia.org/r/1007316

Thanks to the logs and graphs I was able to identify multiple points of improvement. Root cause seems to be that we are flooding the job queue with MessageIndexRebuildJobs and the deduplication is not working as well as one would thought, as multiple job executors pick up jobs concurrently.

With this in mind, I'm fixing one place that causes most jobs to be created and other smaller improvements.

Change 1007309 merged by jenkins-bot:

[mediawiki/extensions/Translate@master] MessageIndexRebuildJob: avoid title, add caller

https://gerrit.wikimedia.org/r/1007309

Change 1007310 merged by jenkins-bot:

[mediawiki/extensions/Translate@master] MessageHandle: avoid flooding the job queue with rebuild jobs

https://gerrit.wikimedia.org/r/1007310

Change 1007311 merged by jenkins-bot:

[mediawiki/extensions/Translate@master] MessageIndex: improve logging

https://gerrit.wikimedia.org/r/1007311

Change 1007312 merged by jenkins-bot:

[mediawiki/extensions/Translate@master] DatabaseMessageIndex: reduce lock wait time from 30 to 5 seconds

https://gerrit.wikimedia.org/r/1007312

Change 1007316 merged by jenkins-bot:

[mediawiki/extensions/Translate@master] MessageIndex: improve rebuilding

https://gerrit.wikimedia.org/r/1007316

Nikerabbit set the point value for this task to 4.Mar 4 2024, 1:01 PM

For future reference, debug log entries are visible on mwlog1002.eqiad.wmnet, but not in Logstash. Useful dashboard to check is https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=MessageIndexRebuildJob

All my changes so far should address the issue of "Failed to load translation aids".

The first case of Special:Translate not recognizing the group should be impossible, given the act of marking the page for translation updates message groups during the POST request.

Change rOPUP1009488beabd had a related patch set uploaded (by Nikerabbit; author: Nikerabbit):

[mediawiki/extensions/Translate@master] TranslateSpecialPage: Add debug logging for T320220

https://gerrit.wikimedia.org/r/1009488

Change rOPUP1009488beabd merged by jenkins-bot:

[mediawiki/extensions/Translate@master] TranslateSpecialPage: Add debug logging for T320220

https://gerrit.wikimedia.org/r/1009488

Change #1016351 had a related patch set uploaded (by Nikerabbit; author: Nikerabbit):

[mediawiki/extensions/Translate@master] Remove touchKey from MessageGroupWANCache

https://gerrit.wikimedia.org/r/1016351

Change #1016351 merged by jenkins-bot:

[mediawiki/extensions/Translate@master] Remove touchKey from MessageGroupWANCache

https://gerrit.wikimedia.org/r/1016351

Change #1016718 had a related patch set uploaded (by Nikerabbit; author: Nikerabbit):

[mediawiki/extensions/Translate@master] Special:Translate: reduce false positives in debug logging

https://gerrit.wikimedia.org/r/1016718

I haven’t meet this issue for several weeks (probably MW 1.42.0-wmf.21), whereas it consistently failed from November. Thank you for the work! 🙂

Change #1016718 merged by jenkins-bot:

[mediawiki/extensions/Translate@master] Special:Translate: reduce false positives in debug logging

https://gerrit.wikimedia.org/r/1016718

Considering resolved for now. In case this still appears in the future, we have some more logging now.

I'm getting this again in Tech News, as of ~10 minutes ago.

image.png (849×1 px, 122 KB)

But after 10 minutes of poking, it self-resolved.

Page marked for translation: 2024-04-19T03:21:22 Quiddity (WMF) talk contribs marked Tech/News/2024/17 for translation (UTC+3)

Unfortunately debug logs start at 2024-04-19 02:49:38.836336 (UTC+0) making them useless.

Logstash only shows level info or above. Relevant request id is ad31c7f1-a9f2-4e43-ad4d-7fc906448e9b

There is "Inconsistent revision ID" warnings from ParserCache.

Update command finishes Apr 19, 2024 @ 00:21:23.717 after starting at Apr 19, 2024 @ 00:21:22.466.

There are also multiple log entries from the ttmserver like rebuild command completed on eqiad for Translations:wikiLearn/Courses/course-v1:Wikimedia-Foundation+wmf commdev partnerships cg+2022/en/block-v1:Wikimedia-Foundation+wmf commdev partnerships cg+2022+type@problem+block@a10501fa16a6447d9011be599417e2a4/problem.multiplechoiceresponse.choicegroup.choice. for the same request id. Why weren't these processed earlier? According to the page history the page hasn't had edits for over a year: https://meta.wikimedia.org/w/index.php?title=WikiLearn/Courses/course-v1:Wikimedia-Foundation%2Bwmf_commdev_partnerships_cg%2B2022/en/block-v1:Wikimedia-Foundation%2Bwmf_commdev_partnerships_cg%2B2022%2Btype@problem%2Bblock@a10501fa16a6447d9011be599417e2a4&action=history

Also noticed that we apparently process deletes for ttmserver for things that are not in translation namespaces, like files and wikidata IDs.

Got to read the logs better now with another example. I see everything going fine:

  • Interim cache is added
  • On next rebuild interim cache is removed

But right after that I'm seeing messages like MessageIndex is out of date. Page Translations:International Museum Day 2024/5/en refers to unknown group page-International Museum Day 2024. This spawns new MessageIndexRebuildJobs (as it should, but no longer overwhelming everything thanks to my earlier fixes) that run but the error doesn't go away. I can only see this happening if the MessageGroupLoader cache is stale. Additionally when using mwdebug, I cannot reproduce the problem on eqiad debug servers, but I can reproduce it on codfw debug servers. All the log warnings are also from codfw.

This makes me suspect there is some issue with cross-db replication and that I will probably need to consult people familiar on this topic.

I ran MessageGroups::singleton()->clearCache(); via shell.php and loaded Special:Translate from codfw and I think that fixed the problem.

I've read https://www.mediawiki.org/wiki/Object_cache and the code documentation for WANObjectCache and I think following changes should be done:

  • Interim Cache should be changed to use https://www.mediawiki.org/wiki/Object_cache#Main_stash given it's the only one that is replicated across datacenters.
  • The removal of touchKey was incorrect and should be restored. WANObjectCache only replicates deletes and purges across datacenters.

Change #1024641 had a related patch set uploaded (by Nikerabbit; author: Nikerabbit):

[mediawiki/extensions/Translate@master] MessageIndex: Use MainObjectStash for interim cache

https://gerrit.wikimedia.org/r/1024641

Change #1024641 merged by jenkins-bot:

[mediawiki/extensions/Translate@master] MessageIndex: Use MainObjectStash for interim cache

https://gerrit.wikimedia.org/r/1024641

Nikerabbit changed the point value for this task from 4 to 16.Tue, May 14, 1:22 PM

Change #1031866 had a related patch set uploaded (by Nikerabbit; author: Nikerabbit):

[mediawiki/extensions/Translate@master] Simplify message group caching

https://gerrit.wikimedia.org/r/1031866

Change #1032149 had a related patch set uploaded (by Nikerabbit; author: Nikerabbit):

[mediawiki/extensions/Translate@master] Simplify message group caching

https://gerrit.wikimedia.org/r/1032149

Change #1032149 abandoned by Nikerabbit:

[mediawiki/extensions/Translate@master] Simplify message group caching

Reason:

mistake

https://gerrit.wikimedia.org/r/1032149

Change #1031866 merged by jenkins-bot:

[mediawiki/extensions/Translate@master] Simplify message group caching

https://gerrit.wikimedia.org/r/1031866

Being once more optimistic that this issue is now solved for good. Fingers crossed.