Page MenuHomePhabricator

Special:BannerRandom causing lots of fatal error messages
Closed, ResolvedPublic2 Estimated Story PointsPRODUCTION ERROR

Description

Spotted in production:

   478 Undefined property: SpecialBannerRandom::$allocContext in /srv/mediawiki/php-1.25wmf16/extensions/CentralNotice/special/SpecialBannerRandom.php on line 27
   478 error: Argument 1 passed to BannerChooser::__construct() must be an instance of AllocationContext, null given in /srv/mediawiki/php-1.25wmf16/extensions/CentralNotice/includes/BannerChooser.php on
line 41

Event Timeline

demon raised the priority of this task from to Needs Triage.
demon updated the task description. (Show Details)
demon subscribed.
demon triaged this task as Unbreak Now! priority.Feb 12 2015, 4:07 PM

This is the single noisiest item in the hhvm log right now. Please fix as it drowns out all other information.

awight renamed this task from Bad allocation context for BannerChooser to Special:BannerRandom causing lots of fatal error messages.Feb 12 2015, 9:12 PM
awight set Security to None.

In both varnish-ncsa and varnish-kafka based logs, we only see ~30K/day requests to Special:BannerRandom.

Reading the text cluster's varnish setup, it looks like there should be 1 retry per 503. Hence each of the above ~30K/day requests should account for 4 lines in hhvm.log (2xNotice, 2xFatal).

For all the 20-ish varnish-kafka based log lines that I checked, I could find those expected 4 lines in hhvm.log.

However, hhvm.log also has SpecialBannerRandom Notices/BannerChooser Failures that seem to lack a corresponding varnish-kafka log line.

For example on ~2015-02-12T07:13:06 only one Special:BannerRandom request is in the the varnish-kafka logs (with a few seconds buffer before/after without such a request). But the hhvm.log has 32 corresponding lines. That's much more than the 4 expected lines.

We store the varnish-kafka logs unsampled, so it cannot be sampling.
And the varnish-kafka logs seem flawless for the above timestamp.

  • Are there under some circumstances more than 1 retry for 503s on text? (That might help to explain the mismatch in volume between the number of varnish-kafka log lines and the number of lines in hhvm.log)
  • Can it be that some services try to fetch Special:BannerRandom directly from the app-servers (or whatever is serving the Special:BannerRandom requests)? (Thereby by-passing the caches and the logging through varnish-kafka)

(Lacking access to the app-servers, I cannot respond to those two questions)

Thanks a ton, @QChris!! A few more notes:

  • The code for Special:BannerRandom was thought to be essentially dead and is fully deprecated on WMF production. This happened in November, 2014.
  • An update to CentralNotice was deployed on Tuesday. This is when the error started happening. We understand why that code doesn't work, but we don't know why it's being called. In theory, it should only get called when there' s a request to Special:BannerRandom.
  • Request logs don't show an increase in incoming requests to Special:BannerRandom at the time of the recent CentralNotice update.
  • Today, some supposedly unused remnants of configurations related to Special:BannerRandom were removed from mediawiki-config, apparently with no effect.
  • Since it's been such a long time since calls to Special:BannerRandom were in theory turned off, it seems strange to still be getting all these calls to it (the ~30K/day that we know are coming in). This strangeness is in addition to the strangeness of all the errors in the hhvm log, which seem to be a lot more than the ~30k/day can account for (see above).
  • There's no clear pattern in the devices sending the Special:BannerRandom requests (other than maybe a high number of mobile devices).
  • We have only one stack trace for these hhvm log entries, which does show a request to Special:BannerRandom as the cause. The URL has the same old params that used to get sent in, so it does look like it could come from some old cached Javascript that just won't go away.
  • The referrers seem to be normal wiki pages, which would also be expected for clients that are somehow still running with the old JS that did use Special:BannerRandom.
  • @awight made a very nice patch to fully strip out Special:BannerRandom from CentralNotice. This is feasible, but I'm a wee bit scared. What if whatever is calling this broken code then breaks worse and breaks other stuff? Or what if this is a symptom of some unexpected behavior that needs to be uncovered, and removing the code for Special:BannerRandom just masks it? I confess I'd feel happier and cozier getting to the bottom of this before deploying such a change.

Thanks much!! :)

There are two mysteries, right?

  1. why is this page being called?
  2. why are there significantly more hhvm error logs than there are webrequest logs for this page?

Mystery 1 could be explained away by cached client JS, but maybe not too. Mystery 2 is a little more interesting, but, since this is dead code, my opinion is that it isn’t worth the (possibly large) effort in solving it.

If someone else disagrees, I think Andy needs more stacktraces from the error logs to look at the code path that is calling this page. Maybe some of them are directly hitting this as a request, but maybe some others are coming from some other less direct code path. I'm not sure how to get him more stacktraces, so if someone wants to help, that's the first thing to do.

If no one really cares so much, then let's just have them deploy the change to remove the code, eh?

There are two mysteries, right?

  1. why is this page being called?
  2. why are there significantly more hhvm error logs than there are webrequest logs for this page?

Mystery 1 could be explained away by cached client JS, but maybe not too. Mystery 2 is a little more interesting, but, since this is dead code, my opinion is that it isn’t worth the (possibly large) effort in solving it.

Agreed. If it's already code on its way out, let's just finish making it go away. I really don't think it's that interesting of a mystery to keep looking into.

If someone else disagrees, I think Andy needs more stacktraces from the error logs to look at the code path that is calling this page. Maybe some of them are directly hitting this as a request, but maybe some others are coming from some other less direct code path. I'm not sure how to get him more stacktraces, so if someone wants to help, that's the first thing to do.

I don't know what more we need. P289 yesterday showed us it was coming from the expected code path. The description here shows what breaks when that does ($allocContext isn't set in BannerRandom since the refactor, ends up being passed null to BannerChooser). The fix is clearly:

  • Remove dead code, or
  • Fix SpecialBannerRandom to properly set $allocContext before passing it.

So yes, maybe as an academic exercise it would be nice to understand why it's being called, but I don't think it's worth leaving broken in prod until we do.

If no one really cares so much, then let's just have them deploy the change to remove the code, eh?

Yes, please.

Quick update here: we're preparing a simple patch to just return a 410 from the Special:BannerRandom endpoint. That should be guaranteed to not break anything else. (The patch for fully removing the code is more complicated and needs more careful checking before deployment.)

@chad and @Ottomata thanks so much!

This comment was removed by AndyRussG.
gerritbot subscribed.

Change 190506 had a related patch set uploaded (by AndyRussG):
Deprecate Special:BannerRandom and return a 410

https://gerrit.wikimedia.org/r/190506

Patch-For-Review

Change 190506 merged by jenkins-bot:
Deprecate Special:BannerRandom and return a 410

https://gerrit.wikimedia.org/r/190506

Change 190559 had a related patch set uploaded (by AndyRussG):
Warn and die at unexplained SpecialBannerRandom execution

https://gerrit.wikimedia.org/r/190559

Patch-For-Review

Change 190559 merged by jenkins-bot:
Warn in a special log on unexpected code execution

https://gerrit.wikimedia.org/r/190559

Change 190596 had a related patch set uploaded (by AndyRussG):
Fix detail in warn in a special log on unexpected code execution

https://gerrit.wikimedia.org/r/190596

Patch-For-Review

Change 190596 merged by jenkins-bot:
Fix detail in warn in a special log on unexpected code execution

https://gerrit.wikimedia.org/r/190596

We think the spam to hhvm.log is resolved now, it's been almost 2 hours with no more messages. Sorry, kthx bye!

BTW, we're capturing some extra info about the phantom code execution in fluorine:/a/mw-log/T89258.log and should have some more insight pretty soon. :) Many thanks to everyone, and especially to @awight for his brilliant loggy suggestions and deploy-fu!

Results from this debug log:

  • All the calls to this code were indeed coming through the expected route (SpecialBannerRandom::execute()).
  • A sample of the URLs shows:
    • a relatively normal distribution of countries, except for a high prevalence of India and "XX" (which means CentralNotice couldn't determine the country), and
    • a high percentage of mobile devices (as detected by CentralNotice).

I suppose the above might be interesting if anyone would like to investigate the forever-cached, deprecated JS that is making these calls (mystery no. 1, above).

Regarding mystery no. 2, the unexplained higher prevalence of the error messages in hhvm.log vis-à-vis the number of requests to Special:BannerRandom: it turns out to be explained by mistakes on my part, namely that I thought Varnish made two retries on 503s (it actually does three) and I made a dumb calculation mistake involving disappearing parentheses :(. In other words, no mystery no. 2.

Thanks everyone for your patience with my thickheadedness and slow learning about what to do in this kind of situation. And many thanks to @K4-713, who also made some excellent loggy suggestions.

Unless anyone objects, we'll undeploy the special debug logging at the next opportunity.


@AndyRussG interesting that you're seeing a high prevalence of country=XX. We've gotten a smattering of reports about that, and it happens to me consistently, but on first investigation since the fix for it a few months ago we thought maybe it was a caching issue. Here's the bug about it: T87677

Should we reopen?

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:11 PM
Restricted Application added a subscriber: Pcoombe. · View Herald TranscriptAug 28 2019, 11:11 PM