Page MenuHomePhabricator

Some interface messages (e.g. sitenotice, others) are loading old revisions of their messages
Closed, ResolvedPublic

Description

In simplewiki, the Mediawiki:Mycontris is set to "My changes" and has been changed more than nine years ago but at the top of the page, it shows "My edits":


It's not falling back to English but it's using a really old value instead so I'm confused what's going on here.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
daniel added a subscriber: BPirkle.EditedMar 22 2019, 11:17 AM

This patch touching MessageCache landed on monday:

https://gerrit.wikimedia.org/r/c/mediawiki/core/+/496816/5/includes/cache/MessageCache.php

If that patch somehow caused some/all pages in the MediaWiki namespace to be ignored and the standard message to be used, that would explain the problem. It would also point to a disturbing lack of unit tests.

A first attempt to reproduce locally failed. Overriding a message in the MediaWiki namespace worked as expected. Stepping through the new code, I see nothing suspicious at a first glance.

I note that I see "My changes" at the top of the page on simplewiki, as expected. Has this been fixed by some kind of purge since this ticket was filed?

T218983 suggests that this problem exists for *all* locally overwritten interface messages. Can anyone confirm this?

That ticket also states that a null edit fixes the issue for a moment, but then it's reverted to the old bad state.

I'm gonna go ahead and say it's an off-by-one error. For
https://sv.wikipedia.org/wiki/MediaWiki:Excontent
https://sv.wikipedia.org/wiki/MediaWiki:Excontentauthor
https://sv.wikipedia.org/wiki/MediaWiki:Nstab-main
https://sv.wikipedia.org/wiki/MediaWiki:Ipboptions
https://sv.wikipedia.org/wiki/MediaWiki:Sitenotice
it seems to show the [ current - 1 ] revision.

I note that I see "My changes" at the top of the page on simplewiki, as expected. Has this been fixed by some kind of purge since this ticket was filed?

As I said in T218983, resaving a message does fix the issue, but only for a few seconds, or perhaps a minute.

@Nikerabbit tipped me off to the possibility of the query fetching too many revisions, rather than failing to find revisions. This is indeed the case: the rev_id = page_latest condition is lost in https://gerrit.wikimedia.org/r/c/mediawiki/core/+/496816/5/includes/cache/MessageCache.php. Fix upcoming.

Change 498362 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] Only load latest revision in MessageCache::loadFromDB

https://gerrit.wikimedia.org/r/498362

Change 498363 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@wmf/1.33.0-wmf.22] Only load latest revision in MessageCache::loadFromDB

https://gerrit.wikimedia.org/r/498363

I noticed this on itwiki, too, so I'll provide another example for completeness: this week I have changed visualeditor-ca-editsource from "Modifica wikitesto" to "Modifica sorgente". Viewing any page will result in a quick flash of the new message, which is then suddenly replaced by the old one. While trying to debug I went to this page, which uses {{in:visualeditor-ca-editsource}} and showed the old version. Purging the page updated the message to the new one.

daniel added a comment.EditedMar 22 2019, 1:02 PM

Scheduled SWAT deployment between 18:00 and 19:00 UTC.

No SWAT today, it's Friday. Silly me.

After the fix is live, affected messages can be fixed by purging the respective pages in the MediaWiki namespace.
We may be able to purge them in bulk, but not sure yet.

Scheduled SWAT deployment between 18:00 and 19:00 UTC.

There's no SWAT today (it's Friday and you put it in yesterday SWAT) but this task is UBN! giving it permission to deploy at any time.

There's no SWAT today (it's Friday and you put it in yesterday SWAT) but this task is UBN! giving it permission to deploy at any time.

Hahaha! ooops!

daniel added a comment.EditedMar 22 2019, 1:22 PM

Some thoughts on how to update caches:

Calling MessageCache::load() should work, perhaps from a one-off script. There are some caveats to this:

  • The method is protected. Would need to use reflection to call it, or make it accessible.
  • The method is per language. We'd probably want to call it for the content language at least. Other languages may be less important.
  • MessageCache::load() grabs a lock, other threads trying to do the same may pile up. These shouldn't be too frequent, though. Also, running load() shouldn't take very long (wild guess: ten seconds).

MessageCache::clear() could also work, but then we have no control over when and how the re-caching is triggered. For all I know, it may happen during a web request, possibly causing a timeout.

abian added a subscriber: abian.Mar 22 2019, 1:31 PM

MessageCache::clear would fit the bill, but I'm not sure it will fully be able to avoid cache refill storm due to web requests (it should be fine due to use of locks, but not my area of expertise). Bumping MSG_CACHE_VERSION would probably do the same.

$wgMsgCacheExpiry seems to be unaltered, default is one day, so we could just wait.

MessageCache::recache method would be ideal, but doesn't exist.

Trying to purge with manual edits is both difficult and doesn't update whole cache anyway, so that is out of question.

Perhaps one could try to write a script that temporarily makes one of the conditions on MessageCache::isCacheExpired to return false (most likely the return value of wfTimestampNow) so that it gets recached with (in)direct calls to load() but that's getting way too complicated imho.

I noticed this on itwiki, too, so I'll provide another example for completeness: this week I have changed visualeditor-ca-editsource from "Modifica wikitesto" to "Modifica sorgente". Viewing any page will result in a quick flash of the new message, which is then suddenly replaced by the old one. While trying to debug I went to this page, which uses {{in:visualeditor-ca-editsource}} and showed the old version. Purging the page updated the message to the new one.

But this was on TranslateWiki, not on it.wikipedia.org directly, so if I understand @daniel’s fix correctly that shouldn’t even have been affected? itwiki:MediaWiki:visualeditor-ca-editsource doesn’t exist as a page.

I noticed this on itwiki, too, so I'll provide another example for completeness: this week I have changed visualeditor-ca-editsource from "Modifica wikitesto" to "Modifica sorgente". Viewing any page will result in a quick flash of the new message, which is then suddenly replaced by the old one. While trying to debug I went to this page, which uses {{in:visualeditor-ca-editsource}} and showed the old version. Purging the page updated the message to the new one.

But this was on TranslateWiki, not on it.wikipedia.org directly, so if I understand @daniel’s fix correctly that shouldn’t even have been affected? itwiki:MediaWiki:visualeditor-ca-editsource doesn’t exist as a page.

That's correct. I have to say that I didn't check the patch before writing my comment. It could be a different issue, although that would possibly be a peculiar coincidence.

I noticed this on itwiki, too, so I'll provide another example for completeness: this week I have changed visualeditor-ca-editsource from "Modifica wikitesto" to "Modifica sorgente". Viewing any page will result in a quick flash of the new message, which is then suddenly replaced by the old one. While trying to debug I went to this page, which uses {{in:visualeditor-ca-editsource}} and showed the old version. Purging the page updated the message to the new one.

But this was on TranslateWiki, not on it.wikipedia.org directly, so if I understand @daniel’s fix correctly that shouldn’t even have been affected? itwiki:MediaWiki:visualeditor-ca-editsource doesn’t exist as a page.

That's correct. I have to say that I didn't check the patch before writing my comment. It could be a different issue, although that would possibly be a peculiar coincidence.

Translatewiki.net isn't even running the code that caused this issue yet, so it can't be the same thing. Also, message transclusion is a rather different beast. Seeing a "flash" of another version indicates the usage of some JavaScript trickery to apply localization in the user language. This all sounds like it's a different issue.

@daniel Well, while it's true that some JS is altering the message, that JS isn't from a user or wiki script, as the issue still happens in safemode. Maybe VE is replacing the standard message with its own? But yes, it could be a different bug. I guess we'll see after the patch above is deployed.

MessageCache::recache method would be ideal, but doesn't exist.

It looks to me like MessageCache::loadFromDBWithLock should have more or less the desired effect – specifically, these lines:

$cache = $this->loadFromDB( $code, $mode );
$this->cache->set( $code, $cache );
$saveSuccess = $this->saveToCaches( $cache, 'all', $code );

It’s protected, but could be called from eval.php e. g. via TestingAccessWrapper.

Zppix added a subscriber: Zppix.Mar 22 2019, 2:34 PM
Teles added a subscriber: Teles.Mar 22 2019, 2:39 PM

I'm gonna go ahead and say it's an off-by-one error.

For the record: it's not an off-by-one error. The problem was that the query would load all revisions, instead of just the latest revision, and write them all to the cache, all using the same key. So the last revision in the query result would win. Since no order is specified on the query, which revision will come last is undefined in theory, but in practice, it seems to always be the one with the largest revision ID. This will usually be the latest revision, so the bug doesn't occur in most cases, and wasn't caught in testing.

To reproduce the bug, I had to import an old revision, to create a revision with a higher revision ID that was still not the correct current revision. The same effect of creating an old revision with a new/high revision ID could in the past also be triggered by deleting and undeleting a page, but we have been restoring the original revision ID for a while now.

I suspect that the affected messages are ones with undeleted or imported revisions. Or the order of the result set is simply different in production, due to index optimization and query planning on large databases.

It looks to me like MessageCache::loadFromDBWithLock should have more or less the desired effect – specifically, these lines:

You are right, loadFromDBWithLock() is probably the best choice. load() would not work, since it loads from the cache instead of refreshing it. And loadFromDB() itself only loads, it doesn't do anything to caches.

But then, if the bad messages have been coming back periodically (due to some cron job or job runner activity, presumably), I'd expect the good messages to be restored soon, in the same way.

FTR, even if mine is a different issue, it has the same solution: purging the local page of the message, even if it doesn't exist, fixed it.

Jdforrester-WMF added a subscriber: Jdforrester-WMF.

Worth putting in Tech/News as relatively high profile.

Just FYI, Wikiversity user:Billinghurst upped the Mediawiki site id by one on Wikiversity and very old site notice disappeared at least for now.

Jseddon added a comment.EditedMar 22 2019, 3:31 PM

Fixed SOPA site notice displaying on ja.wiki

Similar SOPA site notice had to be removed from Portuguese wikipedia yesterday evening.

Change 498362 merged by jenkins-bot:
[mediawiki/core@master] Only load latest revision in MessageCache::loadFromDB

https://gerrit.wikimedia.org/r/498362

Base added a subscriber: Base.Mar 22 2019, 3:50 PM

There was such an issue in ptwiki with regards to sitenotice yesterday, when I performed a zero-edit (i.e. opened for editing and clicked save without changes), the banner was gone.

Change 498363 merged by jenkins-bot:
[mediawiki/core@wmf/1.33.0-wmf.22] Only load latest revision in MessageCache::loadFromDB

https://gerrit.wikimedia.org/r/498363

Nirmos added a subscriber: Rxy.Mar 22 2019, 3:51 PM

@Jseddon, @Rxy: Just editing MediaWiki:Sitenotice isn't good enough because it's going through the revisions and picking another one at random. You need to hide it with CSS like this: https://sv.wikipedia.org/w/index.php?diff=45286235

Incrementing the Sitenotice_id seeming to work on ja.wiki

Mentioned in SAL (#wikimedia-operations) [2019-03-22T15:58:07Z] <James_F> UBN hot-deploy for T218918: Only load latest revision in MessageCache::loadFromDB

OK, hot-fix is now emergency-deployed. Theoretically this means this should be fixed for new page impressions. It's probable that some broken pages will be in the caches, however, so logged-out users may well get the wrong experience.

What's the status? Is the problem gone?

abian added a comment.EditedMar 22 2019, 5:27 PM

What's the status? Is the problem gone?

It works for me if I manually purge the corresponding pages in the MediaWiki namespace (what you suggested on T218918#5047858).

daniel added a comment.EditedMar 22 2019, 5:29 PM

It works for me by manually purging the corresponding pages in the MediaWiki namespace (what you suggested on T218918#5047858).

Can you confirm that they do not revert back to a bad state after an hour or so? THis is what was happening before. Purging always helped, but didn't stick.

abian added a comment.Mar 22 2019, 6:23 PM

Still working for me.

Jdforrester-WMF closed this task as Resolved.Mar 22 2019, 6:25 PM
Jdforrester-WMF assigned this task to daniel.

Provisionally declaring fixed, then.

Since no order is specified on the query, which revision will come last is undefined in theory, but in practice, it seems to always be the one with the largest revision ID. This will usually be the latest revision, so the bug doesn't occur in most cases, and wasn't caught in testing.

More specifically, at least on MySQL/MariaDB it usually matches the order of whatever index the DB used for the fetching of the rows. For the broken query here there are several options. Which of the several candidate indexes it picks is likely up to the details of the replica's index statistics, which can vary between replicas and might change as new revisions are inserted or deleted. If it picks rev_page_id or page_timestamp that would result in it using the latest revision by rev_id or rev_timestamp, but there are other options.

Tiven2240 reopened this task as Open.Mar 23 2019, 7:50 AM

These are two live screenshots of Sitenotice that are shown of October 2018. This is Happening from past 2 days and users are getting misguided due to this. Please fix it.

The sites are hi.wikipedia.org and mr.wikipedia.org

Please fix it.

Did they happen after the fix was released into production? (That is, after 2019-03-22 16:00 UTC, or 2019-03-22 19:30 IST.)

Do the problems remain for logged-out users after action=purge?

OK, can you give any URLs where it occurs? I can't replicate that here.

On hiwiki, I found the wrong sitenotice (this one instead of the one) on this article via Special:Random; it had been generated at an old time (2019-03-08T23:50:51, found in the page HTML). After an action=purge, the timestamp was updated (to 2019-03-23T19:19:07). Once I reloaded the page, it showed the correct (current) sitenotice.

Similarly, on mrwiki, I found the wrong sitenotice (this one instead of blank) on this article, generated at 2019-03-21T20:04:03; purging bumped it to 2019-03-23T19:24:10.

Unfortunately it's not possible for us to purge the old cached pages. A quick hack for the sitenotice would be to set .mw-dismissable-notice { display: none } in the site's Common.css until the caches finally expire in thirty days' time.

Change 498692 had a related patch set uploaded (by MaxSem; owner: MaxSem):
[mediawiki/core@master] Disable flapping MessageCacheTest::testLoadFromDB_fetchLatestRevision()

https://gerrit.wikimedia.org/r/498692

Change 498692 merged by jenkins-bot:
[mediawiki/core@master] Disable flapping MessageCacheTest::testLoadFromDB_fetchLatestRevision()

https://gerrit.wikimedia.org/r/498692

Blocked this on T219042: Flaky test MessageCacheTest::testLoadFromDB_fetchLatestRevision. The regression test I added for the fix seems to "sometimes" fail, indicating that the problem "sometimes" still happens - maybe only in the test, maybe also in production. I'm unable to reproduce.

In any case, this ticket should not be closed until we have a regression test that reliably passes.

Change 498804 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] DNM: Fix MessagecacheTest::testLoadFromDB_fetchLatestRevision

https://gerrit.wikimedia.org/r/498804

A fix is up for the flaky test: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/498804
I screwed up and put an invalid timestamp in, causing it to default to "now".

According to what @Jdforrester-WMF said above, the problem reported by @Tiven2240 is probably due to caching on some level. If there are no new occurrences reported, I suppose this ticket can be closed again.

Change 498804 merged by jenkins-bot:
[mediawiki/core@master] Fix MessagecacheTest::testLoadFromDB_fetchLatestRevision

https://gerrit.wikimedia.org/r/498804

Jdforrester-WMF closed this task as Resolved.Mar 25 2019, 4:16 PM

Declaring this fixed for now.