Page MenuHomePhabricator

Poor concurrent performance due to objectcache table locks
Closed, ResolvedPublic

Description

The "Wait" extension from the case study

A single wiki page that takes a long time to render -- due to a
long-running tag extension -- blocks ALL other users of the wiki from
viewing ANY pages.

This bug might sound unlikely -- given that Wikipedia has scaled so
massively -- but I've reproduced it now on four
different MediaWiki servers, two running Windows 2003 Server, and two
running Linux (CentOS, Ubuntu), all with different PHP and Apache
versions & configurations. One of them was a plain vanilla MediaWiki
install created just to test this bug.

The problem doesn't seem to be related to sessions (e.g., when session
data is locked to prevent concurrent writes, as described at
http://us2.php.net/manual/en/function.session-write-close.php). Even
with cookies disabled, the same behavior happens.

CASE STUDY TO REPRODUCE THE PROBLEM

These steps will create two web pages, one slow and one fast, and
demonstrate that ALL other users cannot use the wiki at all while
the slow page is rendering.

  1. Start with a plain vanilla install of MediaWiki 1.13.
  1. Install the "Wait" extension attached to this bug ticket,

that simply sleeps for N seconds. The source code is:

<?php

Wait for N seconds

$wgExtensionFunctions[] = 'wfWaitSetup';
function wfWaitSetup() {

global $wgParser;
$wgParser->setHook('wait', 'wfWait');

}
function wfWait($input) {

global $wgParser;
$wgParser->disableCache();
sleep($input);
return "Slept for $input seconds";

}

  1. Create two wiki articles: one called "Slow" that contains:

    <wait>30</wait>

and one called "Fast" that contains the text "Hello world". (Note
that saving the "Slow" page will take 30 seconds as the <wait> tag is
rendered.)

Now you are ready to demonstrate the problem.

  1. Open web browsers on TWO DIFFERENT PCs. (Alternatively, open two

different browsers -- say, Firefox and IE -- on the same PC.) Do not
just open 2 windows or tabs of the same browser instance, because
they might share session information, which would invalidate this test.

  1. In the first browser, hit the "Slow" page.
  1. While the Slow page is sleeping, in the second browser, hit the

"Fast" page. Ensure that you are NOT getting a cached copy of Fast,
say, by force-refreshing the browser (shift-refresh in Firefox,
ctrl-F5 in IE, etc.) or appending action=purge. (Or hit
Special:Version or other dynamic page instead of the "Fast" page.)

  1. While the Slow page is sleeping, the Fast page WILL NOT RENDER. It

will be delayed until the Slow page has finished sleeping. This is
the problematic behavior: the Slow page -- in an entirely different
session -- is blocking the Fast page.

SOME ANALYSIS

Long-running parser tags like <wait> are realistic. Imagine a parser
tag that hits an external database to run SQL queries (that's how
we found this problem), or a tag that requires an external resource
that is slow or unavailable.

This does not seem to be an Apache problem. While the "Slow" page is
sleeping, you can hit another non-MediaWiki PHP page on the same
webserver (say, test.php that runs "phpinfo()" and exits) and there
are no delays.

This also does not seem to be a PHP session problem. On the same
webserver, if you create two simple PHP scripts that open a session,
and hit them with separate browsers as above, the problem does NOT
occur. Here are example scripts:

<?php

Script number 1

session_start();
sleep(20);
echo "Done";

<?php

Script number 2

session_start();
echo "Done";

If you hit script #1 and, while it's sleeping, you hit script #2,
using independent browsers in different sessions, the problem does not
occur. (Which is good & expected.)

Why doesn't Wikipedia experience this problem? Probably because of its
extreme caching architecture (with squid, etc.), and that it doesn't
have any long-running extensions installed. Or maybe it has a magical
PHP/Apache/MySQL configuration that prevents the problem?

STACK DUMP

I did the case-study above and let the "Fast" page time out and fail
while "Slow" was sleeping. Here is the stack dump from that moment,
produced by xdebug:

Fatal error: Maximum execution time of 30 seconds exceeded in includes/db/Database.php
on line 579
Stack trace:

  1. {main}() index.php:0
  2. MediaWiki->checkInitialQueries() index.php:60
  3. Title::newMainPage() includes/Wiki.php:105
  4. wfMsgForContent() includes/Title.php:293
  5. wfMsgReal() includes/GlobalFunctions.php:380
  6. wfMsgGetKey() includes/GlobalFunctions.php:432
  7. StubObject->get() includes/GlobalFunctions.php:467
  8. StubObject->__call() includes/StubObject.php:0
  9. StubObject->_call() includes/StubObject.php:76
  10. call_user_func_array() includes/StubObject.php:58
  11. MessageCache->get() includes/StubObject.php:0
  12. MessageCache->getMsgFromNamespace() includes/MessageCache.php:543
  13. MessageCache->load() includes/MessageCache.php:606
  14. MessageCache->saveToCaches() includes/MessageCache.php:250
  15. BagOStuff->add() includes/MessageCache.php:419
  16. SqlBagOStuff->set() includes/BagOStuff.php:100
  17. SqlBagOStuff->delete() includes/BagOStuff.php:274
  18. SqlBagOStuff->_query() includes/BagOStuff.php:288
  19. MediaWikiBagOStuff->_doquery() includes/BagOStuff.php:319
  20. Database->query() includes/BagOStuff.php:433
  21. Database->doQuery() includes/db/Database.php:540

And the offending line 579 is:

$ret = mysql_query( $sql, $this->mConn );

BRION VIBBER'S TEST

Brion did a quick test with this on his local wiki and thought the
problem might be session-related. He writes in mediawiki-l:

"If I preview a page with <wait>20</wait>, then go load something up
in another tab in the same browser, it sits there waiting on both
tabs. (Confirmed with Firefox 3 and Safari 3 on Mac OS X.) If on the
other hand I go load things up in another browser, there's no delay
there. If I disable cookies (thus removing session affinity), then a
second tab in the same browser has no slowdown. And indeed, it appears
that PHP session files are by default locked to prevent multiple
simultaneous accesses."

I don't know why Brion had different results than I did. Perhaps his
"Fast" page was cached so there was no slowdown due to the Slow page.
I've asked him to try again and hit Special:SpecialPages instead of
"Fast". Personally, I would LOVE to hear that my wikis are
misconfigured in some way, or that I've made some mistake.
This problem is killing performance for us.


Version: 1.13.x
Severity: critical

Attached:

Details

Reference
bz15460

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:19 PM
bzimport set Reference to bz15460.

I definitely do *not* get the described effect when using separate browsers.

Special:Specialpages, recentchanges, a few random pages, etc all load fine in my second browser (IE in a Windows VM) while my "slow" page is previewing, saving, or rendering in Firefox on my mac.

Looks like a browser issue. In firefox you can probably tweak your
network parameters (about:config) such as :

network.http.max-connections
network.http.max-connections-per-server
network.http.max-persistent-connections-per-server

I don't see how it can be a browser issue. In the case study, we have two separate users on two separate PCs, and each is simply hitting ONE single (different) wiki page.

For the record, my settings are:

network.http.max-connections = 24
network.http.max-connections-per-server = 8
network.http.max-persistent-connections-per-server = 2

Ok, I *do* see this effect on a fresh 1.13 install. I'll see if there's some configuration funniness going on...

Thank you thank you thank you for looking at this. Can I provide more information that would be helpful? php.ini? httpd.conf? LocalSettings.conf (though it's plain vanilla)? mysql config?

Ok here's the skinny:

1.12 branch fresh install, all default options -- no hang.

1.13 branch fresh install, all default options -- hang.

Trunk (1.14 dev) fresh install, all default options -- no hang.

These are all tested on Mac OS X 10.5, MySQL 5.0.51a, default system Apache/PHP (5.2.6).

So there may be some kind of regression in 1.13, which is interestingly fixed (or didn't happen) in dev trunk.

Poked around a little looking for obvious changes in Article and Title DB usage, didn't see anything clear. Assigning to Tim to investigate this as a regression in release branch.

I installed the 1.14 dev trunk, and the problem disappears for me too. (And I
was pretty sure we hadn't experienced it in 1.12.)

Any chance that a fix could make it into the 1.13.1 patch release? I am happy to help in any way I can.

(In reply to comment #8)

I installed the 1.14 dev trunk, and the problem disappears for me too. (And I
was pretty sure we hadn't experienced it in 1.12.)

Any chance that a fix could make it into the 1.13.1 patch release? I am happy
to help in any way I can.

If we want to track down what caused this, someone who has time for it (which I don't, and I guess it's safe to say Brion doesn't either) should do a binary search-like revision witchhunt (see http://en.wikipedia.org/wiki/Binary_search#Number_Guessing_Game for a quick explanation) to see which revision introduced this bug.

That's why I *assigned* it to Tim. But if you want to do it before he wakes up, be my guest. :)

(In reply to comment #10)

That's why I *assigned* it to Tim. But if you want to do it before he wakes up,
be my guest. :)

Yeah, I only noticed that later.

Found it! (Via the binary search technique suggested.)

The problem was fixed in 1.14alpha by Tim in revision 40209:

$ svn log -r 40209 -v
r40209 | tstarling | 2008-08-29 04:35:00 -0400 (Fri, 29 Aug 2008) | 3 lines
Changed paths:

M /trunk/phase3/RELEASE-NOTES
M /trunk/phase3/includes/BagOStuff.php
  • Fixed intermittent deadlock errors involving objectcache table queries. Use a

separate database connection for the objectcache table to avoid long-lasting locks on that table.

I've confirmed that the bug is present in 40208 and fixed in 40209.

Is the delta between 40208 and 40209 safe to patch into 1.13?

On my own wiki, I mean. (I don't have commit access to your svn repository.)

I applied the delta to mw13 and it seems to fix the issue.

Back ported in 1.13. Thanks for bisecting this issue.


r40391 | hashar | 2008-09-03 22:15:33 +0200 (mer, 03 sep 2008) | 4 lines

Fix #15460 : One slow extension can block all wiki pages for all users

Backport of trunk@40209 by Tim Starling. Pointed by Dan Barrett.

Your patch works for me. Thanks so much Ashar and Brion!!!