Banners not allocated correctly
Closed, ResolvedPublic2 Estimated Story Points
Actions

Description

We just experienced a weird bug with CentralNotice. This campaign was enabled and within its timeframe, and on the main campaigns list it was highlighted in green as active. Yet its banners were not showing to users and not listed on Special:BannerAllocation.

This persisted from about 1300 (when the campaign should have started) until 1334 UTC. Disabling and re-enabling the campaign had no effect, nor did removing and re-adding all the banners. What finally got it working was changing the priority from "high" to "emergency". Then on reducing it back down to "high" it continued to work as expected.

I can't recall this ever happening before.

Details

	Subject	Repo	Branch	Lines +/-
	Account for DB lag when refreshing cached ChoiceData	mediawiki/extensions/CentralNotice	master	+21 -4

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		None	T145719 [Tracking] CentralNotice: Unexplained banner display failures
		Resolved		AndyRussG	T144393 Banners not allocated correctly

Event Timeline

Pcoombe created this task.Aug 31 2016, 2:05 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 31 2016, 2:05 PM

• DStrine triaged this task as Unbreak Now! priority.Aug 31 2016, 4:15 PM

• DStrine added projects: Unplanned-Sprint-Work, Fundraising Sprint Pretending This Isn't Happening.

Restricted Application added subscribers: Jay8g, Luke081515, TerraCodes. · View Herald TranscriptAug 31 2016, 4:15 PM

• DStrine lowered the priority of this task from Unbreak Now! to High.Aug 31 2016, 4:18 PM

• DStrine removed projects: Fundraising Sprint Pretending This Isn't Happening, Unplanned-Sprint-Work.

• DStrine moved this task from Triage to Sprint +1 on the Fundraising-Backlog board.

This doesn't look good. :( My first guess is a DB replication issue. Let's try to look at the campaign settings in the logs, and investigate if there were issues with DBs on the cluster...

Just spoke to @jcrespo (thanks!!!!) on IRC about possible DB replication lag on the cluster between 13:00 and 13:40 UTC today:

11:51 <jynus> AndyRussG, I do not see anything relevant; however, unless we have a major outage, replication lag should have nothing to do with those errors
11:52 <jynus> when there is lag, a server gets out of the load balancer
11:52 <jynus> and at most, you could get like issues for 10 seconds; not 30 minutes

K, I just inspected the CentralNotice campaign log tables, and I don't see anything out of the ordinary. I guess that eliminates a UI issue as the culprit.

Also, since the campaign was not showing up both in Special:BannerAllocation and in users' browsers, we can rule out a ResourceLoader caching issue, I think.

@Pcoombe, @spatton, when you opened Special:BannerAllocation, did you see any other campaigns that might have been incorrectly blotting out the FR campaign?

Here are some steps that I'd ask you to please take, if possible, if this happens again:

Ping on IRC.
On a language/project combination where a campaign is expected but doesn't show up, open developer tools, open the console, execute JSON.stringify(mw.centralNotice.choiceData), then copy and save the full output somewhere.
Save as a file the contents of Special:BannerAllocation with the wrong data.

Thanks so much!!! Hopefully we can catch this before anything more bad happens...

P.S. Pls don't hesitate to mention any other ideas you have about stuff to try!!!!

@AndyRussG There was the low priority campaign C1617_enWW_dsk_lw_FR running, and showing in BannerAllocation.

@Pcoombe thanks!

Also: the mobile campaigns C1617_en6C_mob_FR and C1617_en6C_ipd_FR that were turned on for the same project/language/countries at the same time worked fine from the start?

My current best guess for the culprit is some problem with the recently added caching of choices in the MediaWiki Object Cache. See ChoiceDataProvider::getChoices(). Maybe this was a situation in which we don't evict stale data properly, or some temporary issue with the object cache itself. I'll ask around about that.

Apparently no memcached infrastructure issues coincide with the timing of this bug, at least not as far as ganglia reports:

eqiad
codfw

(I don't know enuf about our setup to be sure about the relevance of these graphs.)

Maybe an old memcache cache object wasn't purged? Since the changes made it into the logs, it doesn't seem possible that ChoiceDataProvider::invalidateCache() wasn't called...

@AndyRussG correct, C1617_en6C_mob_FR and C1617_en6C_ipd_FR both worked fine from the start.

Thanks for investigating this.

@spatton cool, thanks!

I'm wondering if maybe the memcached object was purged at one datacenter and not the other... But if that were the case, I think Special:BannerAllocation should have shown the campaign as active...? Heard on IRC that @Joe may know the details of our memcached infrastructure... He'll be back from vacation on Monday...

Another theory!! Maybe it's a race condition that works like this:

the call to admin interface updates the master DB and invalidates the memcached object;
there is an attempt to fetch choiceData shortly after the master DB update;
the callback that sets the memcached object queries a DB slave that has not yet caught up with DB master;
old choiceData gets stuck in memcached and is served until the memcached object is invalidated again.

That might jive with the CN logs. In that case, it could be that the jiggling of on-off setting that @Pcoombe did at 13:24 is what really reset the cache, but the new choiceData was only visible to browsers when the ResourceLoader cache turned over a little later?

...also it could be that the same race condition kept occurring until the last settings change, when the priority was put back to "high"...

From IRC: seems this is a common ObjectCache issue and there are options we need to pass in to invoke code that deals with it :)

• DStrine moved this task from Sprint +1 to Current Sprint & Completed in Q1 1617 on the Fundraising-Backlog board.Sep 1 2016, 8:03 PM

• DStrine added projects: Unplanned-Sprint-Work, Fundraising Sprint Qwerty Thwacking.

AndyRussG set the point value for this task to 2.Sep 2 2016, 3:56 AM

Change 308126 had a related patch set uploaded (by AndyRussG):
Account for DB lag when refreshing cached ChoiceData

https://gerrit.wikimedia.org/r/308126

gerritbot added a project: Patch-For-Review.Sep 2 2016, 4:09 AM

AndyRussG claimed this task.Sep 2 2016, 4:09 AM

AndyRussG moved this task from Backlog to Review on the Fundraising Sprint Qwerty Thwacking board.

Change 308126 merged by jenkins-bot:
Account for DB lag when refreshing cached ChoiceData

https://gerrit.wikimedia.org/r/308126

AndyRussG moved this task from Review to Pending Deployment on the Fundraising Sprint Qwerty Thwacking board.Sep 8 2016, 7:42 PM

AndyRussG mentioned this in T145446: Impressions slow to update on jaJP dsk campaign.Sep 13 2016, 12:15 AM

I'm moving this to the Done column but not closing. It's hopefully done and not related to T145446 or T144952, but you never know....

AndyRussG moved this task from Pending Deployment to Done on the Fundraising Sprint Qwerty Thwacking board.Sep 13 2016, 12:18 AM

• DStrine added a project: Fundraising Sprint Rocket Surgery 2016.Sep 13 2016, 11:02 PM

Ejegg moved this task from Backlog to Done on the Fundraising Sprint Rocket Surgery 2016 board.Sep 14 2016, 12:45 AM

AndyRussG added a parent task: T145719: [Tracking] CentralNotice: Unexplained banner display failures.Sep 15 2016, 3:16 AM

• DStrine closed this task as Resolved.Sep 28 2016, 10:40 PM

• DStrine removed a project: Unplanned-Sprint-Work.Oct 12 2016, 11:43 PM

Banners not allocated correctlyClosed, ResolvedPublic2 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Banners not allocated correctly
Closed, ResolvedPublic2 Estimated Story Points
Actions

Related Objects
Search...