mediawiki waits for memcached connection forever
Closed, ResolvedPublic

Description

Today a bunch of DB transactions hung open from mw1207. The offending apache threads showed this stack trace:

(gdb) zbacktrace
[0xbe6a1190] usleep()
/usr/local/apache/common-local/php-1.23wmf18/includes/objectcache/BagOStuff.php:188
[0xbe6a0ae8] lock()
/usr/local/apache/common-local/php-1.23wmf18/includes/filerepo/file/LocalFile.php:1832
[0xbe6a00b0] lock()
/usr/local/apache/common-local/php-1.23wmf18/includes/filerepo/file/LocalFile.php:1164
[0xbe69fc30] upload()
/usr/local/apache/common-local/php-1.23wmf18/includes/upload/UploadBase.php:692
[0xbe69e790] performUpload()
/usr/local/apache/common-local/php-1.23wmf18/includes/api/ApiUpload.php:649
[0xbe69e3e0] performUpload()
/usr/local/apache/common-local/php-1.23wmf18/includes/api/ApiUpload.php:144
[0xbe69cc98] getContextResult()
/usr/local/apache/common-local/php-1.23wmf18/includes/api/ApiUpload.php:111
[0xbe69c610] execute()
/usr/local/apache/common-local/php-1.23wmf18/includes/api/ApiMain.php:900
[0xbe69c050] executeAction()
/usr/local/apache/common-local/php-1.23wmf18/includes/api/ApiMain.php:364
[0xbe69be48] executeActionWithErrorHandling()
/usr/local/apache/common-local/php-1.23wmf18/includes/api/ApiMain.php:335
[0xbe69aee8] execute()
/usr/local/apache/common-local/php-1.23wmf18/api.php:86
[0xbe69add8] ??? /usr/local/apache/common-local/w/api.php:3

Paravoid noticed mw1207 twemproxy had stopped listening on port 11211. Restarting it caused the open txns to complete and the apache threads to continue as normal.

Mediawiki shouldn't wait forever if memcached is not responding.


Version: 1.23.0
Severity: major

Details

Reference
bz63058
bzimport set Reference to bz63058.

Change 122550 had a related patch set uploaded by Aaron Schulz:
Speed up LocalFile locking behavoir

https://gerrit.wikimedia.org/r/122550

Change 122550 merged by jenkins-bot:
Speed up LocalFile locking behavior

https://gerrit.wikimedia.org/r/122550

Change 123041 had a related patch set uploaded by Aaron Schulz:
Made BagOStuff fail fast in cas/lock on certain errors

https://gerrit.wikimedia.org/r/123041

Change 123041 merged by jenkins-bot:
Made BagOStuff fail fast in cas/lock on certain errors

https://gerrit.wikimedia.org/r/123041

Aaron: Both patches merged. Do we wait for checking if this still happens and if more work is needed, or can this issue be considered fixed?

Add Comment