Page MenuHomePhabricator

Delay in French mobile banners showing up in Banner Allocation
Open, Needs TriagePublic

Description

We put up French & English (France) mobile banners today at 18UTC and noticed that while English/France mobile banners showed up immediately in Banner Allocation, it took French/France mobile at least 1/2 hour. Would Andy have an idea?

Thanks,
Thea

Event Timeline

@TSkaff Were you able to verify that both banners were showing up on the site at the same time?

@DStrine In both campaigns, I set the launch time to be 18UTC. So at 18UTC I noticed the banner allocation worked for English but not French ... and I could use statler to pull data for English but nothing for French.

Thanks!! Checked the data in Turnilo (formerly Pivot), and it shows the problem happened as described: https://bit.ly/2NzT6OW

(There's a problem with the labels on the x-axis of the graph... For correct times of events shown there, hover over the lines and look at the info box next to the cursor. See T197276.)

The English campaign started showing up for users around 18:00, but the French one only appeared around 18:40. The CN logs show both were turned on a bit before 18:00. Also, I don't see anything in the logs for changes in banner settings that might have caused this.

Definitely worth further investigation!!

Just checking stuff to eliminate possible causes: there's nothing odd in the server logs for that time, nor were any actions related to the datacenter switchover set for that day (see T199073).

Also nothing stands out in logstash for that time: search 1, search 2.

Also just re-checked CentralNotice logs... I don't see any changes in any of them around 18:40 on 2018-10-09, which is when the mobil frFR banner finally went out (according to Druid/Turnilo, see above).

Checked a couple more things:

  • Looked at various dashboards for events around this time, in case there were cluster issues that I might have missed in logstash: general mysql, database lag, resource loader (RL is used to send choice data, that is, data about available campaigns and banners), memcache (used to cache choice data). I don't see any potential explanations there.
  • General CN health: everything else in CentralNotice seems OK around this time. Status codes look normal, and no other campaigns appear to have been disrupted.

Summary

What this isn't:

  • Not a more general Mediawiki or database issue, outage, or any known problem on the cluster.
  • Not a campaign or banner configuration issue.
  • Not a general CentralNotice outage.
  • Not a data pipeline issue.

What this is:

  • At least one campaign began to be selected by clients about 40 minutes late.

There may have been other yet-undetected effects of whatever the underlying was.

So far, my best guess as to the cause is a bug in ChoiceData object caching. The object cache TTL for ChoiceData is 1 hour. There are also some changes in campaign and banner configuration a little over an hour before the missing campaign finally started to go out. So, maybe an old version of ChoiceData was being served from that cache, and the campaign was only included in ChoiceData when that version in the cache expired.

Hi! I've found a pretty convincing indication that the problem was with old ChoiceData being sent to browsers.

Looking at HTTP response sizes for the requests that fetch ChoiceData for fr.m.wikipedia.org, we can see that requests for only the ext.centralNotice.choiceData and jquery modules (bundled together) didn't change following the changes in settings around 17:54.

So an old ChoiceData was stuck, almost certainly in the object cache. (That also would explain the lack of updated data in on the Banner Allocation page.)

response_sizes1.png (508×1 px, 25 KB)

Here's the Jupyter notebook with the queries used:

Removing task assignee due to inactivity, as this open task has been assigned to the same person for more than two years (see the emails sent to the task assignee on Oct27 and Nov23). Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome.
(See https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.)