Page MenuHomePhabricator

MessageCache.php: Could not acquire lock
Closed, ResolvedPublic

Description

Exports of the translatewiki.net product ("repoexport ihris") iHRIS always exit with exceptions. I've run export.php commands for each individual group of the product, and encountered the below issue at least 10 times for 100 groups or so.

This issue has started to arise about a month ago. This is basically blocking exports of the product, as doing them manually and fully controlled isn't really an option.

Stack dump details:

2012-12-30 11:00:32 v22011026445052.yourvserver.net mediawiki-bw_: [3dfb4df5] [no req] Exception from line 357 of /www/translatewiki.net/w/includes/cache/MessageCache.php: Could not acquire 'mediawiki-bw_:messages:en:status' lock.
#0 /www/translatewiki.net/w/includes/cache/MessageCache.php(707): MessageCache->load('en')
#1 /www/translatewiki.net/w/includes/cache/MessageCache.php(647): MessageCache->getMsgFromNamespace('Gadgets-definit...', 'en')
#2 /www/translatewiki.net/w/includes/Message.php(670): MessageCache->get('gadgets-definit...', true, Object(Language))
#3 /www/translatewiki.net/w/includes/Message.php(543): Message->fetchMessage()
#4 /www/translatewiki.net/w/extensions/Gadgets/Gadgets_body.php(580): Message->exists()
#5 /www/translatewiki.net/w/extensions/Gadgets/Gadgets_body.php(38): Gadget::loadStructuredList()
#6 [internal function]: GadgetHooks::userGetDefaultOptions(Array)
#7 /www/translatewiki.net/w/includes/Hooks.php(255): call_user_func_array('GadgetHooks::us...', Array)
#8 /www/translatewiki.net/w/includes/GlobalFunctions.php(3856): Hooks::run('UserGetDefaultO...', Array)
#9 /www/translatewiki.net/w/includes/User.php(1240): wfRunHooks('UserGetDefaultO...', Array)
#10 /www/translatewiki.net/w/includes/User.php(4155): User::getDefaultOptions()
#11 /www/translatewiki.net/w/includes/User.php(2220): User->loadOptions()
#12 /www/translatewiki.net/w/includes/context/RequestContext.php(284): User->getOption('language')
#13 /www/translatewiki.net/w/includes/StubObject.php(183): RequestContext->getLanguage()
#14 /www/translatewiki.net/w/includes/StubObject.php(128): StubUserLang->_newObject()
#15 /www/translatewiki.net/w/includes/StubObject.php(78): StubObject->_unstub('getCode', 5)
#16 /www/translatewiki.net/w/includes/StubObject.php(176): StubObject->_call('getCode', Array)
#17 /www/translatewiki.net/w/includes/cache/MessageCache.php(627): StubUserLang->__call('getCode', Array)
#18 /www/translatewiki.net/w/includes/cache/MessageCache.php(627): StubUserLang->getCode()
#19 /www/translatewiki.net/w/includes/Message.php(670): MessageCache->get('parentheses', true, Object(StubUserLang))
#20 /www/translatewiki.net/w/includes/Message.php(431): Message->fetchMessage()
#21 /www/translatewiki.net/w/includes/Message.php(534): Message->toString()
#22 /www/translatewiki.net/w/includes/specials/SpecialVersion.php(190): Message->escaped()
#23 /www/translatewiki.net/w/extensions/Translate/ffs/GettextFFS.php(533): SpecialVersion::getVersion()
#24 /www/translatewiki.net/w/extensions/Translate/ffs/GettextFFS.php(406): GettextFFS->getGenerator()
#25 /www/translatewiki.net/w/extensions/Translate/ffs/GettextFFS.php(357): GettextFFS->doGettextHeader(Object(MessageCollection), Array, false)
#26 /www/translatewiki.net/w/extensions/Translate/ffs/FFS.php(240): GettextFFS->writeReal(Object(MessageCollection))
#27 /www/translatewiki.net/w/extensions/Translate/scripts/export.php(242): SimpleFFS->write(Object(MessageCollection))
#28 {main}


Version: 1.21.x
Severity: critical
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=5092

Details

Reference
bz43516

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.
StatusAssignedTask
OpenNone
Resolvedtstarling

Event Timeline

bzimport raised the priority of this task from to High.
bzimport set Reference to bz43516.
  • Bug 45335 has been marked as a duplicate of this bug. ***

(No work on this bug since the reporting of two months ago), this exception is now often happening in WMF production.

I've seen this happen on Wikimedia Labs (piramido.wmflabs.org and toro.wmflabs.org). There was a thread about it at:

http://permalink.gmane.org/gmane.org.wikimedia.labs/756

Andrew Bogott hoped to fix it with https://gerrit.wikimedia.org/r/#/c/47680/, but that didn't work, at least not for toro.

There may be multiple bugs with the same symptoms, but I'm adding this here until pointed to another number.

greg added a comment.Feb 28 2013, 7:40 PM

Setting to Critical as it is affecting WMF infrastructure as well now.

This obviously needs to get fixed properly.

But since it's not mentioned above, this can often be worked around by restarting memcached. e.g.

sudo /etc/init.d/memcached restart

It will recur again though.

Reedy added a comment.Mar 1 2013, 1:10 AM

(In reply to comment #5)

This obviously needs to get fixed properly.

But since it's not mentioned above, this can often be worked around by
restarting memcached. e.g.

sudo /etc/init.d/memcached restart

It will recur again though.

Let me know when you tell ops you've restarted memcached

(In reply to comment #6)

Let me know when you tell ops you've restarted memcached

I should clarify, that restart command is not intended for WMF production.

It is just a temporary workaround people can use on Labs or their third-party wikis (at their discretion) until the root cause is identified and fixed.

The problem here may be that when the export exits with an exception, locks are kept. While in bug 45335, it is more likely caused by a cache stampede.

I think this messages lock should be replaced with a PoolCounter. That should fix the issues.

(In reply to comment #8)

The problem here may be that when the export exits with an exception, locks
are kept.

If that's the case, then I3066d8db would fix it.

On Wikimedia Labs, when restarting memcached as a workaround it generally gives an error like:

"Restarting memcached: start-stop-daemon: warning: failed to kill nnnnn: No such process"

where nnnnn is the PID from /var/run/memcached.pid

indicating that the process has somehow died.

This is not fixed, we are logging a lot of these.

This started with I811755d4, i.e. Niklas attempting to get someone to fix a bug by making it more severe. Since that change was deployed to WMF, about 523000 exception messages have been delivered to users. Meanwhile, the two people who collaborated on getting that change live (Niklas and Aaron) have not been monitoring the exception rate and have not fixed the underlying issue. Apparently it's now my problem, which is fine, it just would have made my job a bit easier if the people who deliberately introduced this bug could have told me that they did so.

The message cache expiry is one day. When I tested it at an off-peak time (05:24), it took 3.5 seconds to rebuild the commonswiki/ru cache which was previously the subject of an exception flood. The lock wait loop gives up after 10 seconds, and after those 10 seconds, an exception is thrown as long as it has been less than 60 seconds since the lock acquisition. So, what could the problem possibly be? I'll ask my local rocket scientist.

What you wrote is true, except that nobody told me to monitor nor how to monitor exception rates. Very little information flows to me what is available on wmf production and what is expected.

Sorry, I don't understand the purpose of your second paragraph. It's probably a good thing that I'm not building rockets.

Proposed fix: Ia145fd90

ori added a comment.Apr 4 2013, 12:35 AM

(In reply to comment #13)

What you wrote is true, except that nobody told me to monitor nor how to
monitor exception rates.

You can SSH to fluorine.eqiad.wmnet and monitor /a/mw-log/exception.log and /a/mw-log/fatal.log. It is a little obscure.

In the old code, all threads finding the cache empty will wait for the lock, taking turns to run loadFromDB(). They don't recheck the shared cache after the lock is acquired. So if it loadFromDB() takes 1s, and there are 100 req/s, then there will be 99 threads queued up waiting for the lock when the first one finishes. Then each will run loadFromDB() in turn, for another 9 seconds. Then the lock wait timeout will be reached, so the other 90 threads will fail to add the status key and will hit the "Could not acquire $statusKey" exception simultaneously.

This theory agrees pretty well with the pattern of exception floods we see in the logs.

(In reply to comment #13)

What you wrote is true, except that nobody told me to monitor nor how to
monitor exception rates. Very little information flows to me what is
available on wmf production and what is expected.

I didn't expect you to know those things. I expected Aaron to know them.

(In reply to comment #14)

Proposed fix: Ia145fd90

Patch got merged yesterday.

Is it already possible to stay if the patch was effective / can this ticket be closed or is there something left to do?

I have planned an activity on Saturday that usually triggered this bug repeatedly. I hope I remember to update. In fact, I'll add it to my calendar...

swalling wrote:

(In reply to comment #19)

Is it already possible to stay if the patch was effective / can this ticket
be
closed or is there something left to do?

I am not seeing these errors on our Labs instances anymore, FWIW.

Don't see these errors anymore on translatewiki.net where I usually saw them. Thanks for the fix.

We're getting a similar error in response to API requests in our Parsoid RT testing infrastructure. Maybe reopen?

http://parsoid.wmflabs.org:8001/result/d74bd4e8003084de3760ad64e006f426e588a19f/ar/Barbie

Mark: How often does this happen?

Tim: Do you plan to take another look at this now that this ticket got reopened, or shall this be unassigned from you?

mtraceur: How often does this happen?

Tim: Do you plan to take another look at this now that this ticket got
reopened, or shall this be unassigned from you?

Consistently, i.e., I couldn't get it to not happen when testing certain pages.

It looks like that specific page is probably working now, let me see if we have any other instances where there are issues...

No, I guess Parsoid is getting no errors in its RT testing, so I'll close this again.