Page MenuHomePhabricator

Banner not showing up on site
Closed, ResolvedPublic

Description

Hi @DStrine, ran into an unusual and 'serious' bug today where a particular banner does not want to display on Wikipedia.

Here's a preview link: https://en.wikipedia.org/wiki/Guy_Fawkes?banner=B1617_0907_en6C_dsk_p2_sm_rml_mlr&force=1&country=US

49 out of 50 times, it doesn't load - I see this code at the top of the particular Wikipedia article I'm on: ⧼Centralnotice-template-B1617_0907_en6C_dsk_p2_sm_rml_mlr⧽

It did load very occasionally, so I thought it might just be a little quirk. I enabled it as part of our weekly Big English campaign - here's the Results Sheet - but took the campaign down after a couple minutes, because I saw no donations or impressions coming in. Its entry did appear in the Global allocation logs.

This banner is an exact clone of an internal banner that loads every time. Here's that other link.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Here's the chronology of possibly unrelated events:

  • 2016-08-29 - Branch cut, starting the 1.28wmf.18 dev cycle
  • until 2016-09-05 - Many commits (full list of wmf.18 changes)
  • 2016-09-06 - wmf.18 is deployed to group0 wikis (roadmap)
  • 2016-09-07 - wmf.18 is deployed to group1 wikis. meta.wikimedia.org is part of group1
  • 2016-09-07 - this bug (T144952) is reported

I spent a bunch of time on a video call with @awight and @AndyRussG talking through theories, and we put some of our clues out on IRC, trying to get @bd808 and @ori involved, but eventually luring @Krinkle in. ;-) He wrote:

18:55:11 <Krinkle> the 10 min window after invalidation where things work and then they don't, that sounds like WANObjectCache's 'tombstone' and/or 'lockTSE' logic.

@awight also found "NoticeBannerMaxAge": 600 in the CentralNotice configuration, which was another place our speculation took us.

@Krinkle ended up helping describe some of the important challenges after a cache miss. (irc log). I got the sense that @AndyRussG and @awight have a lot better understanding of things to investigate in the CentralNotice code. Thanks @Krinkle!

Hope this helps.

I've been able to trigger this error a few times on the beta cluster by creating a banner, then immediately previewing it on-wiki.

It doesn't seem consistent, though. Maybe this has to do with the exact timing of the preview request.

Also, sometimes, after modifying a banner, I've gotten a preview with the up-to-date version, then reloaded the preview, and it's gone back to an old version. Maybe some randomness in how the beta cluster simulates replica lag? In any case, I'm worried that sometimes we're serving out-of-date versions of banners, but never hearing about it!

You can see the error happen in your browser's Network tab, by looking at the response from Special:BannerLoader. Also you can query the beta cluster's own logstash, like so.

Thanks!!

Change 315460 had a related patch set uploaded (by AndyRussG):
Enable MessageCache debugging on the beta cluster

https://gerrit.wikimedia.org/r/315460

Change 315460 merged by jenkins-bot:
Enable MessageCache debugging on the beta cluster

https://gerrit.wikimedia.org/r/315460

@RobLa-WMF - Thanks for getting involved on this one. Looks like we might need some additional assistance chasing this down before November.

@AndyRussG - What do you need, in order to move forward with addressing this ticket?

@K4-713 Thanks! I'm pretty sure it's somehow related to MessageCache or Revision caching and details of the underlying ObjectCache, so if someone with deep knowledge of those things would like to take a peek, that'd be fantastic!! :)

We just activated some instrumentation on the beta cluster for this. I'm writing notes about what we have so far in this etherpad, and will post a summary here in a bit.

Thanks again!!!

Summary of what seem to be the most important data points:

  • The times I've triggered this on the beta cluster, it's been right after creating a banner.
  • So far, most occurrences of the error that I've verified on production also happen shortly after a banner is created (< 6 min.). (Several remain to be checked.)
  • On the beta cluster, the MessageCache debug log said "Loading en... got from local cache" for at least one request where the error occurred.
  • The error doesn't necessarily happen on the very first request for the banner content following banner creation. (See Hive query for B1617_1011_itIT_dsk_p1_lg_bkg_wfb on the etherpad.)

Change 315712 had a related patch set uploaded (by Ejegg):
Only set memcache hash when saving messages to memcache

https://gerrit.wikimedia.org/r/315712

Following up on the patch ^ by @Ejegg, addressing the issue he found! Looking at the current version of MessageCache::saveToCaches():

	protected function saveToCaches( array $cache, $dest, $code = false ) {
		if ( $dest === 'all' ) {
			$cacheKey = wfMemcKey( 'messages', $code );
			$success = $this->mMemc->set( $cacheKey, $cache );
		} else {
			$success = true;
		}

		$this->setValidationHash( $code, $cache );
		$this->saveToLocalCache( $code, $cache );

		return $success;
	}

So we're setting the validation hash in memcache even when we're only saving to the local cache? Sure doesn't seem right...? This is not code that has changed on a timeline at all close to the appearance of this bug... The only place where this method is called with a value for $dest other than 'all' is here, in MessageCache::load(), when we save to the local cache after having loaded from the global cache. That code hasn't changed in a while, either.

What would be expected to happen if a host writes the wrong message cache hash to memcache? If there are other hosts that do have an up-to-date local cache (as seems to be the case in these errors) wouldn't the hash in memcache get fixed pretty quickly? Also, have there been any recent changes in the underlying systems (ObjectCache and related) that could have jiggled how this operates on production?

Earlier today, @Ejegg and @XenoRyet deployed the MessageCache patch to the beta cluster to test. While it was there, we were unable to reproduce this bug. However... afterwards, when the beta cluster went back to current master, we were still unable to reproduce the error as we had a few days ago!

As of this writing, the most recent occurrence of the error on the beta cluster was 2016-10-12 17:12:40. I don't see any changes in core or in mediawiki-config between then and now that might have impacted on this.

This is still happening on production, however. Most recent occurrence there was 2016-10-13 20:34:33.

In order to understand if rMWaa5be0c1e03d is a safe change, I'd need to better understand the problem @aaron was trying to fix with rMW2be60e777a09, which introduced the setValidationHash call that @Ejegg is changing. Understanding that myself is a little out of my depth, but I'm hoping y'all understand this given the proposed modifications.

Still not able to reproduce on the beta cluster anymore.

FWIW here are links to patches that have merged recently:

Here is a general theory:

  • There is a bug in MessageCache that causes wrong data to be placed in localCache and/or wanCache and remain there longer than it should.
  • The problem never or rarely occurred until recently.
  • Shortly before this bug was reported, a lot of work was done on caching. One or more of those changes jiggled things to make the bug happen, or happen more often.
  • The bug still only happens under certain load conditions. That's why reproducing it on the beta cluster is a flaky affair.

Just a note about the underlying ObjectCache implementations we're dealing with: on production, MessageCache's localCache is an instance of APCBagOStuff, and wanCache is a sweet WANObjectCache.

@aaron knows that code quite well. It might be worth asking him for advice.

Change 315712 merged by jenkins-bot:
Only set memcache hash when saving messages to memcache

https://gerrit.wikimedia.org/r/315712

Change 316204 had a related patch set uploaded (by AndyRussG):
Make $wgMessageCacheType on beta cluster the same as on production

https://gerrit.wikimedia.org/r/316204

@aaron thanks for merging the patch!!

I'm kinda afraid that it wouldn't impact much on this bug, since the code should only ever call saveToCaches() with 'local-only' when data has been loaded from the global cache, and the WAN cache key is not volatile/expired. So the unnecessary setting of the WAN cache key would cause some extra overhead, but the value written should still have been correct. Does that make sense?

Maybe this is it? Or at least something?

$wgMaxMsgCacheEntrySize is 1024. I'm sure that's smaller than most banners—definitely smaller than fundraising banners.

So, most banner content isn't cached with the main structure of messages-by-language. In that structure, the message value [[ https://github.com/wikimedia/mediawiki/blob/3cd76445c176c57915c6de2db1d8f32d760db846/includes/cache/MessageCache.php#L482-L485 | is just set to the string !TOO BIG ]].

However, only that structure, and not the content of larger messages, is taken into account when making the validation hash.

That means that a change to a large message will result in the re-writing of the hash with the same value as before, so hosts will continue to use their local caches.

As far as I (sort of) understand, this could lead to hosts serving stale banner content until a change to a small Mediawiki namespace page (a very small banner or otherwise) jiggles the validation hash, or hosts' local caches expire (at most, $wgMsgCacheExpiry, 24 hours).

However, I don't see how this would cause missing messages, since the validation hash should at least change when the message is first created. Also doesn't explain why this bug tended to appear shortly after banners were created, nor why we started getting it when we did (first week of September).

Thanks again!!!

Aaarg please scratch the preceding comment. (Messages marked as !TOO BIG in the local cache data structure are only stored in and retrieved from the WAN cache. So the validation hash value only serves to decide whether to get stuff from the local cache, which never happens anyway in these cases anyway... I think...)

Please ignore this earlier comment, too. Calling touchCheckKey() also expires the validation hash, which should also cause hosts refresh the local cache.

Here's a patch with inline comments that might clarify some stuff that was confusing me...

https://gerrit.wikimedia.org/r/316368

Change 316383 had a related patch set uploaded (by AndyRussG):
MessageCache: Use checkKeys for large messages

https://gerrit.wikimedia.org/r/316383

^ Maybe that's something? checkKeys wasn't being used when fetching large messages.

Change 316394 had a related patch set uploaded (by Ejegg):
getMsgFromNamespace uses Revision::READ_LATEST

https://gerrit.wikimedia.org/r/316394

Change 316394 abandoned by Ejegg:
getMsgFromNamespace uses Revision::READ_LATEST

Reason:
Cool, let's go with the other staleness fix

https://gerrit.wikimedia.org/r/316394

Another option:

In CentralNotice's BannerRenderer, instead of retrieving banner content as an i18n message, we could get it as a revision of the corresponding page in the Mediawiki namespace, similar to what MessageCache does.

Revision does its own memcaching of serialized content. However, there would inevitably be at least one DB query, and Content::getWikitextForTransclusion() (or something equivalent?) would be called.

This would happen every time Special:BannerLoader hits PHP. A request to that page is made for every banner loaded, but it doesn't get to PHP that often because Varnish (used regardless of logged-in status).

(BTW, this approach was suggested before by @awight and @Ejegg... I'd originally thought it too sweeping, but, taking another look, maybe it's the right way to go.)

Disadvantages:

  • Some sort of performance hit, though I don't know how severe.
  • Untested and big FR campaigns are starting. We'd have to test at scale before December.

Advantages:

  • Less complexity/number of layers of caching for banner content.
  • Reduction in CN's reliance on overloading a system (i18n) that was not designed for timely, mission-critical targeted messages (banners).
  • Freedom to tweak and improve message caching just for the i18n use case, since we wouldn't have to worry about such overloading.
  • Greater certainty that the expected version of a banner is displayed. (This is important for short A/B tests.)

@aaron, @Krinkle, @ori Would it be possible to please get your opinion on this option?

P.S. I guess it's pretty clear that eventually we need a caching mechanism specifically for banner content...

Many thanks!!! :)

I'm not sure pulling the revision will reduce complexity. If we do it that
way, we have the joy of invoking the MediaWiki parser directly...

Another option, I think, would be to add to CN some lightweight monitoring of the banner version served. We could, for example, create a verification hash on the server when a banner is saved, tack it onto banner content, and also put it in a CN table. Then we could send the hash to the client with ChoiceData, and verify it on the client when the banner is received from BannerLoader. At least that way we'd be able to tell if the number of incorrect banner versions served exceeds some reasonable threshold...

(It seems normal that such an intricate system, made to squeeze the most juice out of our limited hardware, should serve a certain number of old banners, and that that number might grow unexpectedly under certain conditions. So... please don't view this idea at all as a criticism of all the meticulous work done on caches... :) )

I'm not sure pulling the revision will reduce complexity. If we do it that
way, we have the joy of invoking the MediaWiki parser directly...

Couldn't we just do the same thing as MessageCache?

$message = $content->getWikitextForTransclusion();

I can imagine a lot goes on under the hood there, but at least for the calling code, it seems painless-ish...? (I did expect worse... before I saw that line...)

@aaron, thanks for improving the checkKeys patch!!!!

I have some minor questions that I'll put in Gerrit... The main thing I'd like to check, please, is that the following possible race condition is taken into account:

  • A banner is updated via CN admin UI.
  • A new revision is saved in the DB (from WikiPage::doEditContent()).
  • The MessageCache checkKey is touched for $wgLanguageCode (en on meta, I'm pretty sure) (WikiPage::doEditUpdates() > MessageCache::replace()).
  • The checkKey purge happens at some DC before the replica DBs catch up.
  • A request for that banner arrives at that DC in that language exactly during that window.
  • The host at that DC asks for the message from MessageCache, which retrieves, from the replica (or maybe from Revision's own cache) the old message (or, for a new banner, maybe the lack thereof), and stores that in the DC's memcache.
  • Since the checkKey touch happened earlier, the message is not expired and sticks around for the duration of MessageCache::mExpiry (24 hours).

Is this taken into account by the $cacheOpts param you added to the calls to WanObjectCache::set()? Or by tombstoning built into WanObjectCache::touchCheckKey()? Or maybe both?

Thanks once again!!!! :D

touchCheckKey() makes a key be seen as invalid if it was created before the current time plus the HOLDOFF window (11 seconds or so). So 11 seconds of DB lag are fine, since it will get refreshed at some point. If there are > 11 seconds of lag, the $cacheOpts fields in the PS2 version of your patch will shorten the expiry to a matter of seconds, and the key will fix itself eventually (around the time the slaves catch up rather than a full 24 hours).

@aaron thanks for the explanation! Just to check, I see the same sort of options are used in the callback in the call to getWithSetCallback() in Revision::newKnownCurrent(). So, if DB lag is greater than 11 seconds, the revision stored in the revision cache will have a very short expiry as well, right?

When lag is > 7 seconds, the TTL cap is TTL_LAGGED, which is 30 seconds.

Ah K thanks... Mmmm just to follow up, to check that I'm understanding (sorry to go on about this...) so, $cacheOpts can't know anything about the the freshness of the specific data being queried--rather it just has overall measurements of lagginess. So it's mainly useful in combination with the HOLDOFF window, which pushes the old key's expiry time into the future, and $cacheOpts ensures we only keep cache data for any significant length of time if lag is far enough below the HOLDOFF threshold. Is that it?

So, if the replicas happen to stay very laggy, key/values will keep getting cached for a maximum of 30 seconds, I guess?

(I was also wondering if there are unusual scenarios possible due to the interactions of two memcache stores, one (MessageCache) getting filled up by data from another (Revision cache)... But it seems not, since Revision cache values are held indefinitely and keys contain revision Ids, no? So there we're always guaranteed the same freshness as is available directly from the replicas, and no retrieval of old values due to a key's expiry being pushed into the future?)

Change 316383 merged by jenkins-bot:
MessageCache: Use checkKeys for large messages

https://gerrit.wikimedia.org/r/316383

Quoting @aaron's response on IRC about the proposal to switch CN BannerRenderer to grab revisions directly:

17:53 <AaronSchulz> in any case, I'd lean toward just doing the core fix, I'm not sure there is a huge amount to gain by a separate getWithSetcallback call.
17:53 <AaronSchulz> though it wouldn't be crazy either.
17:53 <AndyRussG> K :)
17:54 <AaronSchulz> core messages are used for lots of stuff that is important and customized, so it should just "work". [...]

Change 317074 had a related patch set uploaded (by AndyRussG):
MessageCache: Use checkKeys for large messages

https://gerrit.wikimedia.org/r/317074

Change 317074 merged by jenkins-bot:
MessageCache: Use checkKeys for large messages

https://gerrit.wikimedia.org/r/317074

Mentioned in SAL (#wikimedia-operations) [2016-10-20T23:45:05Z] <dereckson@mira> Synchronized php-1.28.0-wmf.22/includes/cache/MessageCache.php: Use checkKeys for large messages (T144952) (duration: 00m 50s)

The changes to core MessageCache were just deployed. If we don't see any new occurrences of this for a few days, I think we could close... :) Thx all!!!!

We are still seeing issues with banner i18n messages missing and banners not being shown. Noticed by @spatton... (thx!!) In the logs (try searching logstash for ' "message key centralnotice" ', in quotes) we have B1617_1024_itIT_dsk_p2_sm_nfo_flg not being found 10 times, over a period of about 8 minutes, from 02:05:18 to 02:13:25, on 2016-10-25. (Note these are just the times Special:BannerLoader hit PHP. The empty response was probably served a lot more by Varnish.)

In itself, this isn't tragic, however...

  • I'm worried about having a bug that we don't understand causing banners to not show up unpredictably, especially going into the December fundraising campaigns. I wouldn't want this to flare up as campaigns scale.
  • I'm pretty sure this is quite inconvenient for folks who are creating banners/setting up campaigns.

I was thinking of checking out details of Hive requests and CN logs for this banner... Any suggestions would be much appreciated! Thanks!!

Are these just-created messages or old ones?

Change 318489 had a related patch set uploaded (by Aaron Schulz):
[WIP] Avoid races in MessageCache::replace()

https://gerrit.wikimedia.org/r/318489

What are all of the layers of cache *above* MessageCache (e.g. CN cache logic and varnish)?

What are all of the layers of cache *above* MessageCache (e.g. CN cache logic and varnish)?

In the case of banner loading, it's just Varnish. Cache TTL should be 10 minutes for logged-out users, and is set in PHP, so it's hit-for-pass. (BTW, this setup almost certainly could use some improvement.)

Server debug logs from requests hitting PHP seem to show that we repeatedly get missing messages via MessageCache. (As noted above: LogStash for ' "message key centralnotice" ', in quotes.)

Also, looking at the webrequest logs related to B1617_1024_itIT_dsk_p2_sm_nfo_flg, I found that the creation and edit of the banner and the failed attempts to retrieve it went to the same DC. So, the issue doesn't seem to be related to relaying info between DCs.

I don't see any code to purge the corresponding banner pages from CDN whenever their message is created/edited? Is that not present? If it were, there would be an initial purge and rebound purge 11 seconds later to cover lag. To cover 11+ seconds of lag, MediaWiki lowers the CDN TTL to $wgCdnMaxageLagged when lag is 6+ seconds high. However, I see raw header() calls in the CN codebase...not sure how that interacts with this logic in MediaWiki::preOutputCommit.

I don't see any code to purge the corresponding banner pages from CDN whenever their message is created/edited? Is that not present? If it were, there would be an initial purge and rebound purge 11 seconds later to cover lag. To cover 11+ seconds of lag, MediaWiki lowers the CDN TTL to $wgCdnMaxageLagged when lag is 6+ seconds high. However, I see raw header() calls in the CN codebase...not sure how that interacts with this logic in MediaWiki::preOutputCommit.

Yeah, CN doesn't do any purge, we just wait out the cache time of 10 minutes (when it's set). I think this part of the setup should definitely be reviewed, also taking into account whatever vlc code might come into play (no idea if there's anything significant there).

Here's separate task for this important topic: T149873.

Change 316204 abandoned by Reedy:
Make $wgMessageCacheType on beta cluster the same as on production

Reason:
Not needed now, seems to have been removed

https://gerrit.wikimedia.org/r/316204

Change 318489 merged by jenkins-bot:
Avoid races in MessageCache::replace()

https://gerrit.wikimedia.org/r/318489

Looking forward to the deployment of https://gerrit.wikimedia.org/r/#/c/318489/ ! Thanks much!!!

(Just to note, the status of this seems the same as before, with the most recent occurrence on January 25th, as before, for minutes immediately after banner creation.)

From looking at logstash, it seems this is basically fixed!

Since deploy of https://gerrit.wikimedia.org/r/#/c/318489/, there are almost no missing banner message messages in the logs. None that coincide with the creation of a banner, which was the principal situation in which the problem had been happening. Only one occurrence today, but it doesn't coincide with banner creation (so, likely, some other edge case, and of much less concern).

Let's keep an eye on the log and close the bug if the trend continues!

Thanks!!!!

In the past week, we have only one case of a banner not showing up right when it was created. There are, however, four unexplained cases of a "Banner message key ... could not be found in ..." message in the log over that time period.

Since that's quite a different problem than what this bug was originally about, we'll close this bug as resolved, and create a new one about these occasional messages. We can then schedule and prioritize that new task appropriately.

Thanks so much, all!!!!!!! :D