The "Wait" extension from the case study
A single wiki page that takes a long time to render -- due to a
long-running tag extension -- blocks ALL other users of the wiki from
viewing ANY pages.
This bug might sound unlikely -- given that Wikipedia has scaled so
massively -- but I've reproduced it now on four
different MediaWiki servers, two running Windows 2003 Server, and two
running Linux (CentOS, Ubuntu), all with different PHP and Apache
versions & configurations. One of them was a plain vanilla MediaWiki
install created just to test this bug.
The problem doesn't seem to be related to sessions (e.g., when session
data is locked to prevent concurrent writes, as described at
http://us2.php.net/manual/en/function.session-write-close.php). Even
with cookies disabled, the same behavior happens.
CASE STUDY TO REPRODUCE THE PROBLEM
These steps will create two web pages, one slow and one fast, and
demonstrate that ALL other users cannot use the wiki at all while
the slow page is rendering.
- Start with a plain vanilla install of MediaWiki 1.13.
- Install the "Wait" extension attached to this bug ticket,
that simply sleeps for N seconds. The source code is:
<?php
Wait for N seconds
$wgExtensionFunctions[] = 'wfWaitSetup';
function wfWaitSetup() {
global $wgParser; $wgParser->setHook('wait', 'wfWait');
}
function wfWait($input) {
global $wgParser; $wgParser->disableCache(); sleep($input); return "Slept for $input seconds";
}
- Create two wiki articles: one called "Slow" that contains:
<wait>30</wait>
and one called "Fast" that contains the text "Hello world". (Note
that saving the "Slow" page will take 30 seconds as the <wait> tag is
rendered.)
Now you are ready to demonstrate the problem.
- Open web browsers on TWO DIFFERENT PCs. (Alternatively, open two
different browsers -- say, Firefox and IE -- on the same PC.) Do not
just open 2 windows or tabs of the same browser instance, because
they might share session information, which would invalidate this test.
- In the first browser, hit the "Slow" page.
- While the Slow page is sleeping, in the second browser, hit the
"Fast" page. Ensure that you are NOT getting a cached copy of Fast,
say, by force-refreshing the browser (shift-refresh in Firefox,
ctrl-F5 in IE, etc.) or appending action=purge. (Or hit
Special:Version or other dynamic page instead of the "Fast" page.)
- While the Slow page is sleeping, the Fast page WILL NOT RENDER. It
will be delayed until the Slow page has finished sleeping. This is
the problematic behavior: the Slow page -- in an entirely different
session -- is blocking the Fast page.
SOME ANALYSIS
Long-running parser tags like <wait> are realistic. Imagine a parser
tag that hits an external database to run SQL queries (that's how
we found this problem), or a tag that requires an external resource
that is slow or unavailable.
This does not seem to be an Apache problem. While the "Slow" page is
sleeping, you can hit another non-MediaWiki PHP page on the same
webserver (say, test.php that runs "phpinfo()" and exits) and there
are no delays.
This also does not seem to be a PHP session problem. On the same
webserver, if you create two simple PHP scripts that open a session,
and hit them with separate browsers as above, the problem does NOT
occur. Here are example scripts:
<?php
Script number 1
session_start();
sleep(20);
echo "Done";
<?php
Script number 2
session_start();
echo "Done";
If you hit script #1 and, while it's sleeping, you hit script #2,
using independent browsers in different sessions, the problem does not
occur. (Which is good & expected.)
Why doesn't Wikipedia experience this problem? Probably because of its
extreme caching architecture (with squid, etc.), and that it doesn't
have any long-running extensions installed. Or maybe it has a magical
PHP/Apache/MySQL configuration that prevents the problem?
STACK DUMP
I did the case-study above and let the "Fast" page time out and fail
while "Slow" was sleeping. Here is the stack dump from that moment,
produced by xdebug:
Fatal error: Maximum execution time of 30 seconds exceeded in includes/db/Database.php
on line 579
Stack trace:
- {main}() index.php:0
- MediaWiki->checkInitialQueries() index.php:60
- Title::newMainPage() includes/Wiki.php:105
- wfMsgForContent() includes/Title.php:293
- wfMsgReal() includes/GlobalFunctions.php:380
- wfMsgGetKey() includes/GlobalFunctions.php:432
- StubObject->get() includes/GlobalFunctions.php:467
- StubObject->__call() includes/StubObject.php:0
- StubObject->_call() includes/StubObject.php:76
- call_user_func_array() includes/StubObject.php:58
- MessageCache->get() includes/StubObject.php:0
- MessageCache->getMsgFromNamespace() includes/MessageCache.php:543
- MessageCache->load() includes/MessageCache.php:606
- MessageCache->saveToCaches() includes/MessageCache.php:250
- BagOStuff->add() includes/MessageCache.php:419
- SqlBagOStuff->set() includes/BagOStuff.php:100
- SqlBagOStuff->delete() includes/BagOStuff.php:274
- SqlBagOStuff->_query() includes/BagOStuff.php:288
- MediaWikiBagOStuff->_doquery() includes/BagOStuff.php:319
- Database->query() includes/BagOStuff.php:433
- Database->doQuery() includes/db/Database.php:540
And the offending line 579 is:
$ret = mysql_query( $sql, $this->mConn );
BRION VIBBER'S TEST
Brion did a quick test with this on his local wiki and thought the
problem might be session-related. He writes in mediawiki-l:
"If I preview a page with <wait>20</wait>, then go load something up
in another tab in the same browser, it sits there waiting on both
tabs. (Confirmed with Firefox 3 and Safari 3 on Mac OS X.) If on the
other hand I go load things up in another browser, there's no delay
there. If I disable cookies (thus removing session affinity), then a
second tab in the same browser has no slowdown. And indeed, it appears
that PHP session files are by default locked to prevent multiple
simultaneous accesses."
I don't know why Brion had different results than I did. Perhaps his
"Fast" page was cached so there was no slowdown due to the Slow page.
I've asked him to try again and hit Special:SpecialPages instead of
"Fast". Personally, I would LOVE to hear that my wikis are
misconfigured in some way, or that I've made some mistake.
This problem is killing performance for us.
Version: 1.13.x
Severity: critical
Attached: