Special:BannerRandom causing lots of fatal error messages
Closed, ResolvedPublic2 Estimated Story PointsPRODUCTION ERROR
Actions

Assigned To

Authored By

	• demon
	Feb 11 2015, 4:24 PM

Description

Spotted in production:

   478 Undefined property: SpecialBannerRandom::$allocContext in /srv/mediawiki/php-1.25wmf16/extensions/CentralNotice/special/SpecialBannerRandom.php on line 27
   478 error: Argument 1 passed to BannerChooser::__construct() must be an instance of AllocationContext, null given in /srv/mediawiki/php-1.25wmf16/extensions/CentralNotice/includes/BannerChooser.php on
line 41

Details

Subject	Repo	Branch	Lines +/-
Fix detail in warn in a special log on unexpected code execution	mediawiki/extensions/CentralNotice	master	+4 -4
Warn in a special log on unexpected code execution	mediawiki/extensions/CentralNotice	master	+9 -7
Deprecate Special:BannerRandom and return a 410	mediawiki/extensions/CentralNotice	master	+9 -0

Customize query in gerrit

Related Objects

Mentioned In: T230991: Special:BannerRandom returns error 410
T88276: [1 day] BUG: CN banner being shown after campaign end
rMEXTed83775498cf: Updated mediawiki/extensions Project: mediawiki/extensions/CentralNotice…
rMEXTc34baa23df9b: Updated mediawiki/extensions Project: mediawiki/extensions/CentralNotice…
rMEXTc98ce06e8642: Updated mediawiki/extensions Project: mediawiki/extensions/CentralNotice…
T89918: OOM reported at SpecialPage.php:534 due to large output from Special:Book
rECNOea51b75ac65f: Fix detail in warn in a special log on unexpected code execution
rECNO8cc51eb130ff: Warn in a special log on unexpected code execution
rECNO255519b21d2c: Deprecate Special:BannerRandom and return a 410
T89292: Create project/tag to collect fatal/exception log related bugs
T89049: Make log responsibilities changes
Mentioned Here: P289 banner random stacktrace

Event Timeline

• demon created this task.Feb 11 2015, 4:24 PM

• demon raised the priority of this task from to Needs Triage.

• demon updated the task description. (Show Details)

• demon added a project: MediaWiki-extensions-CentralNotice.

• demon subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 11 2015, 4:24 PM

greg mentioned this in T89049: Make log responsibilities changes.Feb 11 2015, 9:44 PM

greg mentioned this in T89292: Create project/tag to collect fatal/exception log related bugs.Feb 11 2015, 9:52 PM

This is the single noisiest item in the hhvm log right now. Please fix as it drowns out all other information.

Krenair subscribed.Feb 12 2015, 4:15 PM

AndyRussG subscribed.Feb 12 2015, 4:22 PM

An inaugural member of the Wikimedia-production-error project :)

Associated patches:
https://gerrit.wikimedia.org/r/190239
https://gerrit.wikimedia.org/r/#/c/190241/

awight renamed this task from Bad allocation context for BannerChooser to Special:BannerRandom causing lots of fatal error messages.Feb 12 2015, 9:12 PM

awight set Security to None.

In both varnish-ncsa and varnish-kafka based logs, we only see ~30K/day requests to Special:BannerRandom.

Reading the text cluster's varnish setup, it looks like there should be 1 retry per 503. Hence each of the above ~30K/day requests should account for 4 lines in hhvm.log (2xNotice, 2xFatal).

For all the 20-ish varnish-kafka based log lines that I checked, I could find those expected 4 lines in hhvm.log.

However, hhvm.log also has SpecialBannerRandom Notices/BannerChooser Failures that seem to lack a corresponding varnish-kafka log line.

For example on ~2015-02-12T07:13:06 only one Special:BannerRandom request is in the the varnish-kafka logs (with a few seconds buffer before/after without such a request). But the hhvm.log has 32 corresponding lines. That's much more than the 4 expected lines.

We store the varnish-kafka logs unsampled, so it cannot be sampling.
And the varnish-kafka logs seem flawless for the above timestamp.

Are there under some circumstances more than 1 retry for 503s on text? (That might help to explain the mismatch in volume between the number of varnish-kafka log lines and the number of lines in hhvm.log)

Can it be that some services try to fetch Special:BannerRandom directly from the app-servers (or whatever is serving the Special:BannerRandom requests)? (Thereby by-passing the caches and the logging through varnish-kafka)

(Lacking access to the app-servers, I cannot respond to those two questions)

Thanks a ton, @QChris!! A few more notes:

The code for Special:BannerRandom was thought to be essentially dead and is fully deprecated on WMF production. This happened in November, 2014.
An update to CentralNotice was deployed on Tuesday. This is when the error started happening. We understand why that code doesn't work, but we don't know why it's being called. In theory, it should only get called when there' s a request to Special:BannerRandom.
Request logs don't show an increase in incoming requests to Special:BannerRandom at the time of the recent CentralNotice update.
Today, some supposedly unused remnants of configurations related to Special:BannerRandom were removed from mediawiki-config, apparently with no effect.
Since it's been such a long time since calls to Special:BannerRandom were in theory turned off, it seems strange to still be getting all these calls to it (the ~30K/day that we know are coming in). This strangeness is in addition to the strangeness of all the errors in the hhvm log, which seem to be a lot more than the ~30k/day can account for (see above).
There's no clear pattern in the devices sending the Special:BannerRandom requests (other than maybe a high number of mobile devices).
We have only one stack trace for these hhvm log entries, which does show a request to Special:BannerRandom as the cause. The URL has the same old params that used to get sent in, so it does look like it could come from some old cached Javascript that just won't go away.
The referrers seem to be normal wiki pages, which would also be expected for clients that are somehow still running with the old JS that did use Special:BannerRandom.
@awight made a very nice patch to fully strip out Special:BannerRandom from CentralNotice. This is feasible, but I'm a wee bit scared. What if whatever is calling this broken code then breaks worse and breaks other stuff? Or what if this is a symptom of some unexpected behavior that needs to be uncovered, and removing the code for Special:BannerRandom just masks it? I confess I'd feel happier and cozier getting to the bottom of this before deploying such a change.

Thanks much!! :)

AndyRussG added subscribers: K4-713, Ejegg.Feb 13 2015, 4:17 AM

There are two mysteries, right?

why is this page being called?
why are there significantly more hhvm error logs than there are webrequest logs for this page?

Mystery 1 could be explained away by cached client JS, but maybe not too. Mystery 2 is a little more interesting, but, since this is dead code, my opinion is that it isn’t worth the (possibly large) effort in solving it.

If someone else disagrees, I think Andy needs more stacktraces from the error logs to look at the code path that is calling this page. Maybe some of them are directly hitting this as a request, but maybe some others are coming from some other less direct code path. I'm not sure how to get him more stacktraces, so if someone wants to help, that's the first thing to do.

If no one really cares so much, then let's just have them deploy the change to remove the code, eh?

In T89258#1036734, @Ottomata wrote:

There are two mysteries, right?

why is this page being called?

why are there significantly more hhvm error logs than there are webrequest logs for this page?

Mystery 1 could be explained away by cached client JS, but maybe not too. Mystery 2 is a little more interesting, but, since this is dead code, my opinion is that it isn’t worth the (possibly large) effort in solving it.

Agreed. If it's already code on its way out, let's just finish making it go away. I really don't think it's that interesting of a mystery to keep looking into.

If someone else disagrees, I think Andy needs more stacktraces from the error logs to look at the code path that is calling this page. Maybe some of them are directly hitting this as a request, but maybe some others are coming from some other less direct code path. I'm not sure how to get him more stacktraces, so if someone wants to help, that's the first thing to do.

I don't know what more we need. P289 yesterday showed us it was coming from the expected code path. The description here shows what breaks when that does ($allocContext isn't set in BannerRandom since the refactor, ends up being passed null to BannerChooser). The fix is clearly:

Remove dead code, or
Fix SpecialBannerRandom to properly set $allocContext before passing it.

So yes, maybe as an academic exercise it would be nice to understand why it's being called, but I don't think it's worth leaving broken in prod until we do.

If no one really cares so much, then let's just have them deploy the change to remove the code, eh?

Yes, please.

Quick update here: we're preparing a simple patch to just return a 410 from the Special:BannerRandom endpoint. That should be guaranteed to not break anything else. (The patch for fully removing the code is more complicated and needs more careful checking before deployment.)

@chad and @Ottomata thanks so much!

AndyRussG added a comment.Feb 13 2015, 8:12 PM

This comment was removed by AndyRussG.

Change 190506 had a related patch set uploaded (by AndyRussG):
Deprecate Special:BannerRandom and return a 410

https://gerrit.wikimedia.org/r/190506

Patch-For-Review

Change 190506 merged by jenkins-bot:
Deprecate Special:BannerRandom and return a 410

https://gerrit.wikimedia.org/r/190506

AndyRussG mentioned this in rECNO255519b21d2c: Deprecate Special:BannerRandom and return a 410.Feb 13 2015, 8:52 PM

Change 190559 had a related patch set uploaded (by AndyRussG):
Warn and die at unexplained SpecialBannerRandom execution

https://gerrit.wikimedia.org/r/190559

Patch-For-Review

greg assigned this task to AndyRussG.Feb 13 2015, 9:44 PM

Change 190559 merged by jenkins-bot:
Warn in a special log on unexpected code execution

https://gerrit.wikimedia.org/r/190559

awight mentioned this in rECNO8cc51eb130ff: Warn in a special log on unexpected code execution.Feb 13 2015, 11:00 PM

Change 190596 had a related patch set uploaded (by AndyRussG):
Fix detail in warn in a special log on unexpected code execution

https://gerrit.wikimedia.org/r/190596

Patch-For-Review

Change 190596 merged by jenkins-bot:
Fix detail in warn in a special log on unexpected code execution

https://gerrit.wikimedia.org/r/190596

AndyRussG mentioned this in rECNOea51b75ac65f: Fix detail in warn in a special log on unexpected code execution.Feb 13 2015, 11:35 PM

We think the spam to hhvm.log is resolved now, it's been almost 2 hours with no more messages. Sorry, kthx bye!

All gone! Thanks guys!

awight added projects: § Fundraising Sprint Devo, Wikimedia-Fundraising.Feb 14 2015, 1:23 AM

awight moved this task from In Analysis to Pending Code Review on the § Fundraising Sprint Devo board.

awight moved this task from Pending Code Review to Done on the § Fundraising Sprint Devo board.

BTW, we're capturing some extra info about the phantom code execution in fluorine:/a/mw-log/T89258.log and should have some more insight pretty soon. :) Many thanks to everyone, and especially to @awight for his brilliant loggy suggestions and deploy-fu!

Results from this debug log:

All the calls to this code were indeed coming through the expected route (SpecialBannerRandom::execute()).
A sample of the URLs shows:
- a relatively normal distribution of countries, except for a high prevalence of India and "XX" (which means CentralNotice couldn't determine the country), and
- a high percentage of mobile devices (as detected by CentralNotice).

I suppose the above might be interesting if anyone would like to investigate the forever-cached, deprecated JS that is making these calls (mystery no. 1, above).

Regarding mystery no. 2, the unexplained higher prevalence of the error messages in hhvm.log vis-à-vis the number of requests to Special:BannerRandom: it turns out to be explained by mistakes on my part, namely that I thought Varnish made two retries on 503s (it actually does three) and I made a dumb calculation mistake involving disappearing parentheses :(. In other words, no mystery no. 2.

Thanks everyone for your patience with my thickheadedness and slow learning about what to do in this kind of situation. And many thanks to @K4-713, who also made some excellent loggy suggestions.

Unless anyone objects, we'll undeploy the special debug logging at the next opportunity.

countries.tsv808 BDownload

devices.tsv58 BDownload

AndyRussG edited a custom field.Feb 18 2015, 2:58 AM

@AndyRussG interesting that you're seeing a high prevalence of country=XX. We've gotten a smattering of reports about that, and it happens to me consistently, but on first investigation since the fix for it a few months ago we thought maybe it was a caching issue. Here's the bug about it: T87677

Should we reopen?

• demon mentioned this in T89918: OOM reported at SpecialPage.php:534 due to large output from Special:Book.Feb 19 2015, 12:54 AM

Jdforrester-WMF removed a project: Patch-For-Review.Mar 3 2015, 5:27 PM

AndyRussG mentioned this in rMEXTc98ce06e8642: Updated mediawiki/extensions Project: mediawiki/extensions/CentralNotice….Mar 11 2015, 6:26 AM

AndyRussG mentioned this in rMEXTc34baa23df9b: Updated mediawiki/extensions Project: mediawiki/extensions/CentralNotice….

AndyRussG mentioned this in rMEXTed83775498cf: Updated mediawiki/extensions Project: mediawiki/extensions/CentralNotice….

AndyRussG mentioned this in T88276: [1 day] BUG: CN banner being shown after campaign end.Apr 2 2015, 1:58 AM

• demon moved this task from Untriaged to Resolved on the Wikimedia-production-error board.Jun 12 2015, 6:35 PM

• atgo added a project: Fundraising-Backlog.Jun 30 2015, 10:16 PM

• atgo moved this task from Triage to Closed Sprint Work Q3 1415 on the Fundraising-Backlog board.

• atgo unsubscribed.Mar 30 2016, 10:03 PM

• mmodell removed a subscriber: awight.Jun 22 2017, 9:44 PM

Restricted Application added subscribers: Jay8g, TerraCodes. · View Herald TranscriptJun 22 2017, 9:44 PM

• DStrine edited projects, added Fr-tech-archived-from-FY-2014/15; removed Fundraising-Backlog.Jan 11 2018, 9:28 PM

• mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:11 PM

Restricted Application added a subscriber: Pcoombe. · View Herald TranscriptAug 28 2019, 11:11 PM

Xaosflux mentioned this in T230991: Special:BannerRandom returns error 410.Apr 24 2021, 10:46 AM

	F42789: devices.tsv
	Feb 17 2015, 11:32 PM

	F42788: countries.tsv
	Feb 17 2015, 11:32 PM

Special:BannerRandom causing lots of fatal error messagesClosed, ResolvedPublic2 Estimated Story PointsPRODUCTION ERRORActions

Description

Details

Related Objects

Event Timeline

Special:BannerRandom causing lots of fatal error messages
Closed, ResolvedPublic2 Estimated Story PointsPRODUCTION ERROR
Actions