Page MenuHomePhabricator

Banners not allocated correctly
Closed, ResolvedPublic2 Estimated Story Points

Description

We just experienced a weird bug with CentralNotice. This campaign was enabled and within its timeframe, and on the main campaigns list it was highlighted in green as active. Yet its banners were not showing to users and not listed on Special:BannerAllocation.

This persisted from about 1300 (when the campaign should have started) until 1334 UTC. Disabling and re-enabling the campaign had no effect, nor did removing and re-adding all the banners. What finally got it working was changing the priority from "high" to "emergency". Then on reducing it back down to "high" it continued to work as expected.

I can't recall this ever happening before.

Event Timeline

DStrine lowered the priority of this task from Unbreak Now! to High.Aug 31 2016, 4:18 PM
DStrine moved this task from Triage to Sprint +1 on the Fundraising-Backlog board.

This doesn't look good. :( My first guess is a DB replication issue. Let's try to look at the campaign settings in the logs, and investigate if there were issues with DBs on the cluster...

Just spoke to @jcrespo (thanks!!!!) on IRC about possible DB replication lag on the cluster between 13:00 and 13:40 UTC today:

11:51 <jynus> AndyRussG, I do not see anything relevant; however, unless we have a major outage, replication lag should have nothing to do with those errors
11:52 <jynus> when there is lag, a server gets out of the load balancer
11:52 <jynus> and at most, you could get like issues for 10 seconds; not 30 minutes

K, I just inspected the CentralNotice campaign log tables, and I don't see anything out of the ordinary. I guess that eliminates a UI issue as the culprit.

Also, since the campaign was not showing up both in Special:BannerAllocation and in users' browsers, we can rule out a ResourceLoader caching issue, I think.

@Pcoombe, @spatton, when you opened Special:BannerAllocation, did you see any other campaigns that might have been incorrectly blotting out the FR campaign?

Here are some steps that I'd ask you to please take, if possible, if this happens again:

  • Ping on IRC.
  • On a language/project combination where a campaign is expected but doesn't show up, open developer tools, open the console, execute JSON.stringify(mw.centralNotice.choiceData), then copy and save the full output somewhere.
  • Save as a file the contents of Special:BannerAllocation with the wrong data.

Thanks so much!!! Hopefully we can catch this before anything more bad happens...

P.S. Pls don't hesitate to mention any other ideas you have about stuff to try!!!!

@AndyRussG There was the low priority campaign C1617_enWW_dsk_lw_FR running, and showing in BannerAllocation.

@Pcoombe thanks!

Also: the mobile campaigns C1617_en6C_mob_FR and C1617_en6C_ipd_FR that were turned on for the same project/language/countries at the same time worked fine from the start?

My current best guess for the culprit is some problem with the recently added caching of choices in the MediaWiki Object Cache. See ChoiceDataProvider::getChoices(). Maybe this was a situation in which we don't evict stale data properly, or some temporary issue with the object cache itself. I'll ask around about that.

Apparently no memcached infrastructure issues coincide with the timing of this bug, at least not as far as ganglia reports:

(I don't know enuf about our setup to be sure about the relevance of these graphs.)

Maybe an old memcache cache object wasn't purged? Since the changes made it into the logs, it doesn't seem possible that ChoiceDataProvider::invalidateCache() wasn't called...

@AndyRussG correct, C1617_en6C_mob_FR and C1617_en6C_ipd_FR both worked fine from the start.

Thanks for investigating this.

@spatton cool, thanks!

I'm wondering if maybe the memcached object was purged at one datacenter and not the other... But if that were the case, I think Special:BannerAllocation should have shown the campaign as active...? Heard on IRC that @Joe may know the details of our memcached infrastructure... He'll be back from vacation on Monday...

Another theory!! Maybe it's a race condition that works like this:

  • the call to admin interface updates the master DB and invalidates the memcached object;
  • there is an attempt to fetch choiceData shortly after the master DB update;
  • the callback that sets the memcached object queries a DB slave that has not yet caught up with DB master;
  • old choiceData gets stuck in memcached and is served until the memcached object is invalidated again.

That might jive with the CN logs. In that case, it could be that the jiggling of on-off setting that @Pcoombe did at 13:24 is what really reset the cache, but the new choiceData was only visible to browsers when the ResourceLoader cache turned over a little later?

...also it could be that the same race condition kept occurring until the last settings change, when the priority was put back to "high"...

From IRC: seems this is a common ObjectCache issue and there are options we need to pass in to invoke code that deals with it :)

AndyRussG set the point value for this task to 2.Sep 2 2016, 3:56 AM

Change 308126 had a related patch set uploaded (by AndyRussG):
Account for DB lag when refreshing cached ChoiceData

https://gerrit.wikimedia.org/r/308126

Change 308126 merged by jenkins-bot:
Account for DB lag when refreshing cached ChoiceData

https://gerrit.wikimedia.org/r/308126

I'm moving this to the Done column but not closing. It's hopefully done and not related to T145446 or T144952, but you never know....