Page MenuHomePhabricator

CN Campaign Suppression prior to scheduled start time
Closed, ResolvedPublic2 Estimated Story Points

Event Timeline

I tried reproducing this locally as described, but everything seem worked normally. Any more details or thoughts on how to reproduce would be great! We could try to do so on the beta cluster...

Hi... Thanks for noticing this and for the careful description...!!! I'm not sure the cause is what you describe, though.... Looking just at desktop: while drop in wml 2017-gb impressions around 2:30 pm does seem to coincide with the enabling of C1718_en6C_dsk_FR, there are other similar drops in wml 2017-gb impressions that don't. For example, at 3:40 pm, while C1718_en6C_dsk_FR wasn't re-enabled until 4:09 pm.

Also, if you look at the wml 2017-gb campaign in countries not targeted by C1718_en6C_dsk_FR, you get a similar pattern of drops (though it's a much smaller amount of data).

I think we do need to find out what caused the drops in wml 2017-gb impressions, but to me it seems far from clear that they're caused by campaigns enabled before their start time.

Thanks again!!! :)

There were some issues with site reachability in Europe on September 6 and I think at around the right time which might explain some of these drops. It was discussed in wikimedia-operations IRC, but I wasn't able to find an incident report.

@Jseddon hey... I'm surely missing something... Don't see much in the last graph you linked to, above... Can you tell me which campaign suppressed which one in the test you did? Thanks!!!

OK, thanks. So the campaign that caused this is wlm 2017-gb-blank?

Screenshot of log with relevant changes highlighted:

T175358_logs_screenshot.png (796×1 px, 201 KB)

Next steps would be: re-reviewing the JS code that chooses campaigns, and trying to replicate on the beta cluster.

Intriguing that suppression is only partial, especially for the wlm 2017 campaign.

I'd with your suggestion, @Jseddon, that campaigns not be enabled sooner than necessary until this is dealt with...

Thanks again!!

The suppression was only partial because I was only targetting desktop with
the banner.

Though if you filter the graph to show only desktop devices, the suppression of wlm 0217 is still only partial. (Just highlighting this because it might help us figure out what's going on...)

Hi! I've successfully reproduced this twice using campaigns targeting aa.wikibooks, on production. We're getting a funny object in ChoiceData, like an array with an empty slot. However, in that case, the JS code does not iterate over it correctly. So we can assume it suppresses all campaigns targeting the language/project combination with the affected ChoiceData.

Change 379576 had a related patch set uploaded (by AndyRussG; owner: AndyRussG):
[mediawiki/extensions/CentralNotice@master] Ensure numeric keys without gaps in ChoiceDataProvider::getChoices()

https://gerrit.wikimedia.org/r/379576

K, reproduced locally and smoke tested the fix!!

@Jseddon thanks so much for seeing this, and many apologies for my initial skepticism. You were 100% correct. Amazing catch.

This was due to a subtle bug added with performance improvements, deployed around the middle of 2016.

Technical details

In ChoiceDataProvider::getChoices(), we were caching choiceData with ObjectCache for up to an hour. So the cache included campaigns that might be available within the maximum cache lifetime. Before returning the cache contents, we were filtering out campaigns that had not yet started or had finished. To do so, we used array_filter, which maintains original array keys. This can lead to gaps in array's numbering, in which case json_encode encodes as a JS object, not an array. Then, on the JS side, we were getting a silent error: the code was acting as if there were no choices, because the object's length property was undefined. tl;dr: Aaaargh!!!!

Likely impact

This should have only affected the system during one hour or less before an enabled campaign was scheduled to start, or within one hour after an enabled campaign's end time. However, it would have affected all users in the project-language permutations targeted by such a campaign. (I.e., if a campaign targeted English and German Wikipedia, then all users of those wikis would have been affected, regardless of country, device, etc.)

The exact duration of the outages would have depended on the exact timing of object cache refreshes. In many cases, they most likely lasted less than an hour.

(The case provided by @Jseddon does fulfill these requirements--the log entry that I missed, the details of which are not displayed in the screenshot provided above, was the change in the start time of the suppressing campaign, to less than an hour before the outage began.)

Thanks so much once again, @Jseddon, @Pcoombe and everyone else, for your help with this!!! :D

Change 379576 merged by jenkins-bot:
[mediawiki/extensions/CentralNotice@master] Ensure numeric keys without gaps in ChoiceDataProvider::getChoices()

https://gerrit.wikimedia.org/r/379576

Change 381049 had a related patch set uploaded (by AndyRussG; owner: AndyRussG):
[mediawiki/extensions/CentralNotice@wmf_deploy] Ensure numeric keys without gaps in ChoiceDataProvider::getChoices()

https://gerrit.wikimedia.org/r/381049

Change 381049 abandoned by AndyRussG:
Ensure numeric keys without gaps in ChoiceDataProvider::getChoices()

Reason:
Re-submitting with message about cherry-pick

https://gerrit.wikimedia.org/r/381049

Change 381049 restored by AndyRussG:
Ensure numeric keys without gaps in ChoiceDataProvider::getChoices()

Reason:
Gerrit silliness

https://gerrit.wikimedia.org/r/381049

Change 381049 merged by jenkins-bot:
[mediawiki/extensions/CentralNotice@wmf_deploy] Ensure numeric keys without gaps in ChoiceDataProvider::getChoices()

https://gerrit.wikimedia.org/r/381049

Mentioned in SAL (#wikimedia-operations) [2017-09-27T18:46:20Z] <catrope@tin> Finished scap: SWAT: T176762, T175358, T172023, T174719, and add html5-misnesting to Linter (duration: 22m 55s)

AndyRussG set the point value for this task to 2.Oct 23 2017, 1:44 AM

I've attempted to reproduce again on production, and everything seems fine... So, apparently, the patch fixed it (or maybe the fix patched it)... @Jseddon, @Pcoombe, have you seen any indications of this bug since the patch was deployed? Thanks much!!!! :D

Closing, since this seems all good now :)