Page MenuHomePhabricator

memcache on labsconsole.wikimedia.org craps out pretty often
Closed, ResolvedPublic

Description

Currently when I try to access https://labsconsole.wikimedia.org/wiki/Main_Page, it takes about 15 seconds to load. Something is broken.


Version: unspecified
Severity: normal

Details

Reference
bz42127

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 1:06 AM
bzimport added a project: Cloud-VPS.
bzimport set Reference to bz42127.

Restarted memcached on virt0. Is that better?

(In reply to comment #1)

Restarted memcached on virt0. Is that better?

Yes, much. Thank you. :-)

I'm not sure if this bug is now resolved or if there's an underlying issue that needs to be investigated and corrected. I'll leave that determination to you.

The memcached instance on virt0 dies like twice per day, that causes the labsconsole to be practically unusable and also drop any user session.

I would say that is part of the Infrastructure component. Please ping either Ryan Lane or Andrew Boggott (ops team) to get this issue resolved.

Possible culprit: virt0 going out of memory and memcached being killed by Linux out of memory killer.

The crash is caused by a hardware problem on the host (bad memory, probably.) We have plans to migrate to fresh hardware but it won't happen immediately.

In theory puppet is restarting memcached when it runs, which should limit the periods of outage. Have y'all experienced more than 30 minutes at a time of this?

Thanks for the confirmation Andrew. Can't we bring the machine down and run a memory test? That should isolate the faulty memory.

As for memcached, I usually have someone from ops to restart memcached, so downtime is pretty short for me :-]

I could not access to the Nagios history for the "Virt0 > memcached" service. Hard to know how long it stays down.

*** Bug 44499 has been marked as a duplicate of this bug. ***

I've downgraded memcached on virt0. I've noticed the same behavior on nova-precise2, so it's very likely not a memory issue, but some memcached bug. If we still experience crashes, then I'll install a version with debugging symbols so that I can get a proper backtrace.

I got the error from 44499 when loading a set of five tabs on the same wiki after a browser restart.

This is not necessarily memcache, since this box is running on WinCache.

Just not waiting long enough? The server may have had to spin up as well.


Could not acquire 'ImpulseWiki:messages:en:status' lock.

Backtrace:

#0 D:\MediaWiki\core\includes\cache\MessageCache.php(710): MessageCache->load('en')
#1 D:\MediaWiki\core\includes\cache\MessageCache.php(650): MessageCache->getMsgFromNamespace('Pagetitle', 'en')
#2 D:\MediaWiki\core\includes\Message.php(720): MessageCache->get('pagetitle', true, Object(Language))
#3 D:\MediaWiki\core\includes\Message.php(464): Message->fetchMessage()
#4 D:\MediaWiki\core\includes\Message.php(553): Message->toString()
#5 D:\MediaWiki\core\includes\OutputPage.php(835): Message->text()
#6 D:\MediaWiki\core\includes\OutputPage.php(878): OutputPage->setHTMLTitle(Object(Message))
#7 D:\MediaWiki\core\includes\Article.php(554): OutputPage->setPageTitle('Powershell suck...')
#8 D:\MediaWiki\core\includes\actions\ViewAction.php(44): Article->view()
#9 D:\MediaWiki\core\includes\Wiki.php(439): ViewAction->show()
#10 D:\MediaWiki\core\includes\Wiki.php(305): MediaWiki->performAction(Object(Article), Object(Title))
#11 D:\MediaWiki\core\includes\Wiki.php(565): MediaWiki->performRequest()
#12 D:\MediaWiki\core\includes\Wiki.php(458): MediaWiki->main()
#13 D:\MediaWiki\core\index.php(59): MediaWiki->run()
#14 {main}

This was due to a newer version of memcache and the way it handles memory exhaustion. I'm not sure why it's still open, it was fixed ages ago.