Commons Main Page availability issue
Closed, ResolvedPublic

Description

Author: afeldman

Description:
The commons main page has been periodically unavailable this morning due to the poolqueue for this page filling up.

  • http://commons.wikimedia.org/wiki/Main_Page isn't parser cacheable. With debug logging enabled, "Parser output marked as uncacheable" is logged, which comes from Parser::disableCache. That seems to only be called from one place which requires ( $title->getNamespace() == NS_SPECIAL && $this->mOptions->getAllowSpecialInclusion() && $this->ot['html'] ) to be true. I don't see anything Main_Page related in the special namespace, so not sure what's going on there? It also results in "don't cache" headers for squid.
  • The poolcounter makes a lot of sense for hot / rapidly changing pages that can be parser cached. One apache gets the lock, all others queue up behind it, or after 50, return an immediate error. For a popular page that can't be parser cached, it really sucks. All requests are serialized and stack up, resulting in very page load times, or immediate errors.
  • Pages like this are insanely easy to DOS - either deliberately with minimal effort or just due to natural traffic spikes.
  • Main Pages should probably all be parser cacheable and/or we should disable use of the poolqueue on pages that aren't. It currently seems like this isn't determined until after parsing however.

Version: unspecified
Severity: critical

bzimport added a subscriber: Unknown Object (MLST).
bzimport set Reference to bz30428.
bzimport created this task.Via LegacyAug 17 2011, 8:17 PM
Bawolff added a comment.Via ConduitAug 17 2011, 8:25 PM

Appears to be caused by including <categorytree>. Not sure why thats triggering the special page transclusion cache killing stuff

I think we should still cache pages with transcluded special pages, just maybe for a limitted time (like {{CURRENTDAY}} does) since they no longer vary with url params in trunk, but that's a side issue.

Bawolff added a comment.Via ConduitAug 17 2011, 9:00 PM

Note, the special page transclusion thing is unrelated. Several extensions call Parser::disableCache, including CategoryTree.

What's really surprising is that Wikimedia doesn't have $wgCategoryTreeDisableCache = false; set. that setting really should be set to false for larger sites.

Reedy added a comment.Via ConduitAug 17 2011, 10:09 PM

PoolCounter has been re-enabled

$wgCategoryTreeDisableCache has been set to false too

bzimport added a comment.Via ConduitAug 17 2011, 10:30 PM

afeldman wrote:

$wgCategoryTreeDisableCache = false did the trick, the commons home page is now getting parser cached as well as cached by squid.

This issue appears to have arisen due to a DoS attack generating ~5k reqs/sec that got lucky and happened upon this week spot in our infrastructure. Action has also been taken to block that traffic.

We should check extensions used in production for Parser::disableCache calls as this general issue could hit us again elsewhere.

Bawolff added a comment.Via ConduitAug 17 2011, 10:40 PM

We should check extensions used in production for Parser::disableCache calls as
this general issue could hit us again elsewhere.

grepping says the following extensions can disable cache in some circumstance (going through the one's that are in /branches/wmf/1.17wmf1/extensions):

*DonationInterface
*Quiz
*CommunityVoice
*ScanSet

Quiz is probably the only one that is really widely used. I'm unsure if anything uses CommunityVoice or ScanSet anymore and DonationInterface is probably something that might be an exception. In core you can do stuff like {{special:recentchanges}} which will disable cache which probably don't really need to disable cache (Especially for things like {{special:prefixindex/foo}})

RobLa-WMF added a comment.Via ConduitAug 18 2011, 4:37 PM

Re-enabling the cache seems to no only have solved the problem, but (not too surprisingly) brought page load times down pretty substantially:
http://status.wikimedia.org/8777/163404/Wiki-commons-%28s4%29

The downside is that I imagine we're going to start getting complaints about the CategoryTree being out of date. It seems as though completely disabling the cache is *very* rarely the right answer in production, and that setting a very short time-to-live on the parser cache (e.g. 5 minutes) will be good enough in 99% of cases. Is setting parser cache TTL something that is as easy for extension authors to do as disabling the cache entirely?

MarkAHershberger added a comment.Via ConduitAug 18 2011, 6:25 PM

Discussion of Comment #6 branched to Bug #30448 since the issue in this bug is resolved.

Add Comment