Page MenuHomePhabricator

Warning: Memcached::setMulti(): failed to set key global:segment:...
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error message
ErrorException from line 340 of /srv/mediawiki/php-1.35.0-wmf.23/includes/libs/objectcache/MemcachedPeclBagOStuff.php: PHP Warning: Memcached::setMulti(): failed to set key global:segment:enwiki%3Apcache%3Aidhash%3A23309859-0!canonical:ce9eb2174b45be4c0a2966f5bfbf5f045e3b6388
Impact
Notes

Quickly got 40+ of these warnings in logspam-watch after rolling 1.35.0-wmf.23 to all wikis, numbers seemed to be increasingly steadily.

Details

Request ID
XmqX4ApAMM0AAKPJ0lgAAAAB
Request URL
https://en.wikipedia.org/w/api.php/w/api.php
Stack Trace
020-03-12 20:14:26 [XmqX4ApAMM0AAKPJ0lgAAAAB] mw1379 enwiki 1.35.0-wmf.23 error ERROR: [XmqX4ApAMM0AAKPJ0lgAAAAB] /w/api.php   ErrorException from line 340 of /srv/mediawiki/php-1.35.0-wmf.23/includes/libs/objectcache/MemcachedPeclBagOStuff.php: PHP Warning: Memcached::setMulti(): failed to set key global:segment:enwiki%3Apcache%3Aidhash%3A23309859-0!canonical:ce9eb2174b45be4c0a2966f5bfbf5f045e3b6388 {"exception_id":"XmqX4ApAMM0AAKPJ0lgAAAAB","exception_url":"/w/api.php","caught_by":"mwe_handler"} 
[Exception ErrorException] (/srv/mediawiki/php-1.35.0-wmf.23/includes/libs/objectcache/MemcachedPeclBagOStuff.php:340) PHP Warning: Memcached::setMulti(): failed to set key global:segment:enwiki%3Apcache%3Aidhash%3A23309859-0!canonical:ce9eb2174b45be4c0a2966f5bfbf5f045e3b6388
  #0 [internal function]: MWExceptionHandler::handleError(integer, string, string, integer, array)
  #1 /srv/mediawiki/php-1.35.0-wmf.23/includes/libs/objectcache/MemcachedPeclBagOStuff.php(340): Memcached->setMulti(array, integer)
  #2 /srv/mediawiki/php-1.35.0-wmf.23/includes/libs/objectcache/MediumSpecificBagOStuff.php(608): MemcachedPeclBagOStuff->doSetMulti(array, integer, integer)
  #3 /srv/mediawiki/php-1.35.0-wmf.23/includes/libs/objectcache/MediumSpecificBagOStuff.php(805): MediumSpecificBagOStuff->setMulti(array, integer, integer)
  #4 /srv/mediawiki/php-1.35.0-wmf.23/includes/libs/objectcache/MediumSpecificBagOStuff.php(166): MediumSpecificBagOStuff->makeValueOrSegmentList(string, ParserOutput, integer, integer)
  #5 /srv/mediawiki/php-1.35.0-wmf.23/includes/libs/objectcache/MultiWriteBagOStuff.php(327): MediumSpecificBagOStuff->set(string, ParserOutput, integer, integer)
  #6 /srv/mediawiki/php-1.35.0-wmf.23/includes/libs/objectcache/MultiWriteBagOStuff.php(150): MultiWriteBagOStuff->doWrite(array, boolean, string, array)
  #7 /srv/mediawiki/php-1.35.0-wmf.23/includes/parser/ParserCache.php(354): MultiWriteBagOStuff->set(string, ParserOutput, integer, integer)
  #8 /srv/mediawiki/php-1.35.0-wmf.23/includes/poolcounter/PoolWorkArticleView.php(213): ParserCache->save(ParserOutput, WikiPage, ParserOptions, string, integer)
  #9 /srv/mediawiki/php-1.35.0-wmf.23/includes/poolcounter/PoolCounterWork.php(125): PoolWorkArticleView->doWork()
  #10 /srv/mediawiki/php-1.35.0-wmf.23/includes/page/WikiPage.php(1236): PoolCounterWork->execute()
  #11 /srv/mediawiki/php-1.35.0-wmf.23/includes/api/ApiParse.php(603): WikiPage->getParserOutput(ParserOptions, integer, boolean)
  #12 /srv/mediawiki/php-1.35.0-wmf.23/includes/api/ApiParse.php(166): ApiParse->getParsedContent(WikiPage, ParserOptions, boolean, NULL, NULL, boolean)
  #13 /srv/mediawiki/php-1.35.0-wmf.23/includes/api/ApiMain.php(1590): ApiParse->execute()
  #14 /srv/mediawiki/php-1.35.0-wmf.23/includes/api/ApiMain.php(522): ApiMain->executeAction()
  #15 /srv/mediawiki/php-1.35.0-wmf.23/includes/api/ApiMain.php(493): ApiMain->executeActionWithErrorHandling()
  #16 /srv/mediawiki/php-1.35.0-wmf.23/api.php(84): ApiMain->execute()
  #17 /srv/mediawiki/w/api.php(3): require(string)
  #18 {main}

Related Objects

StatusSubtypeAssignedTask
ResolvedReleasehashar
ResolvedReleasebrennen
ResolvedPRODUCTION ERRORKrinkle

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Krinkle moved this task from General to libs/objectcache on the MediaWiki-libs-BagOStuff board.
Krinkle subscribed.

It is normal for Memcached to sometimes time out or fail (we prefer fast failure, and tolerate errors there). As such, those are configured to go to their own channel (channel:memcached) that we don't monitor. It returns false to the caller like a cache miss and recovers fine.

This however is the first time I see the php-memcached emit native PHP errors. That shouldn't be the case indeed and is likely a regression.

Which of the action APIs is causing this? Is it even a specific one, or all of them?

It is normal for Memcached to sometimes time out or fail (we prefer fast failure, and tolerate errors there). As such, those are configured to go to their own channel (channel:memcached) that we don't monitor. It returns false to the caller like a cache miss and recovers fine.

FWIW, that's problematic design and causing problems any time memcached is used for anything other than caching. E.g. it has caused T106066: Don't show "Nonce already used" error on memcache failure IIRC.

@Tgr That's a fundamental contract between MW and Memcached that I don't think we will depart from anytime soon. Non-cache use cases do not belong in Memcached. The use case of T106066 is what session store, echo store, main stash, db-replicated, or dedicated tables are for.

I'm trying to figure out if this is an actual code change or a memcached server problem with bad timing. Or maybe it's a code change that causes memcached to have a bad time? Does anyone have a theory there?

I see a lot of timeouts from the memcached server at that time. I also see requests not just to the API, but to other endpoints; e.g., wiki/Star?render threw an error during this period.

Also the things I see in the changelog that mention "cache" are...many things: https://www.mediawiki.org/wiki/MediaWiki_1.35/wmf.23/Changelog

Given that this was happening across many code paths I guess I'm looking for a new use that overwhelmed the backend, or some fundamental change in the handling code (which didn't obviously [at least to me] happen in wmf.23).

There was a spike in network transmit rate on mc1029 and mc1030 at the time in question. The other servers did not show such a spike. So it may be a small number of keys with large values being frequently requested.

mc net tx spike.png (277×756 px, 24 KB)

I used admin requests to verify that out of 10 selected log entries from 20:10:36, all 10 were associated with mc1030, consistent with saturation of its outbound networking at the time.

I tried pulling out the top 100 keys by error rate from that second, but nothing really stood out. So I tried multiplying the request count of those keys by their current size in memcached, but it was still uninspiring.

I noticed that Redis is also running on mc1030. There's no visibility into what nutcracker/Redis is doing. If nutcracker times out while talking to Redis, no error is logged, it silently fails over to the next server. Statistics are incremented, and they seem to show that nutcracker was not dropping mc1030 at the time:

[0300][tstarling@mw1242:~]$ nc localhost 22222 | php -r '$s = stream_get_contents(STDIN); print str_replace("\"\"", "\"", $s);' | jq .redis_eqiad.shard12
{
  "server_eof": 0,
  "server_err": 0,
  "server_timedout": 4,
  "server_connections": 1,
  "server_ejected_at": 1569023964391925,
  "requests": 164938413,
  "request_bytes": 15263903696,
  "responses": 164938409,
  "response_bytes": 19860256544,
  "in_queue": 0,
  "in_queue_bytes": 0,
  "out_queue": 0,
  "out_queue_bytes": 0
}

This output shows that mc1030 was last ejected on 2019-09-20, but maybe it just has a very long timeout. It's hard to say whether Redis was responsible for the network spike when we don't have a traffic breakdown of Redis versus Memcached on mc1030.

We do have detailed statistics for the memcached instance, which shows that the traffic originated from memcached, not redis.

mc1030 mc network.png (285×773 px, 18 KB)

We can even see which slab the key was in:

mc1030 slabs.png (280×516 px, 22 KB)

That's slab 136 with a chunk size of 83KB.

I looked at the logged keys again. The top key from 20:13 by request rate was WANCache:t:arwiki:gadgets-definition:9:2 (by a small margin). This key appeared very rarely in log messages from another day. It's just a timestamp key, but it's fetched using getWithSetCallback() with a check key, which means that the timestamp key and the value key would have been requested together with getMulti(). The value would have been 82658 bytes, consistent with the chunk size above.

The remaining mysteries are:

  • Why don't we see log messages of the form "Memcached error: $msg", without the key in the message? If getMulti() is at fault then you would expect MemcachedPeclBagOStuff::doGetMulti() to call checkResult() with $key=false
  • Why would there be a spike in traffic for arwiki:gadgets-definition, when the extension code has not changed?
  • Did something go wrong with the anti-stampede and APC local caching code that is supposed to prevent a spike in requests to remote gadget definition cache keys?
  • The value key should have been routed to mc1031, so why would there be a spike in 83KB requests to mc1030?

EDIT: With gzip the key would have been 11577 bytes, too small for the chunk size of 83KB.

If we want to verify which key causes those errors, we need to coordinate between SRE and release engineering and do some partial deployment. Please reach out to serviceops
so that we can schedule a test.

Any thoughts on this causing yesterday's parsercache overload at T247788: Parsercache sudden increase of connections? Even if the timestamps don't match?

WDoranWMF subscribed.

I'm going to move this to our watching state, please push it back to our inbox if we can do more to help, or just tell me.

The key in question is enwiki:messages:en which upon release started being requested up to 10k times per second. Given the size of the key - 82788 bytes - this of course caused congestion on the poor server who had to serve it, specifically mc1030.

I suppose we have to look at whatever patch introduces this caching and revert it for now, and think of alternative approaches.

That key seems to include UI messages, which I thought were originally kept in php arrays?

That key seems to include UI messages, which I thought were originally kept in php arrays?

You are thinking of l10n cache which is a cache over the json files. Message cache ( *:messages:*) is a cache of message overrides in MediaWiki namespace per wiki and per language. Both cache similar things, but from a different source.

The key in question is enwiki:messages:en which upon release started being requested up to 10k times per second. Given the size of the key - 82788 bytes - this of course caused congestion on the poor server who had to serve it, specifically mc1030.

I suppose we have to look at whatever patch introduces this caching and revert it for now, and think of alternative approaches.

*Thinking out loud*
83k is not too big but when lots of nodes start to request it at the same time, it causes network congestion. I think the easiest way to fix it is to have a copy of this in APCu. What do you think?

Would that mean we have a huge message somewhere under https://en.wikipedia.org/wiki/MediaWiki:XXX ?

The largest messages on en.wikipedia.org (just scrolling through Special:AllMessages) are:

Of these, the Spam-blacklist has been changed actively in the past few days and is presently 93,993 bytes.

So, these keys are stored/read from MessageCache. It has multi-level cache with localServer cache, cluster cache and WAN cache.

I think the easiest way to fix it is to have a copy of this in APCu. What do you think?

From reading the code, this should be done already.

Comparing MessageCache itself with wmf.22 - wmf.23, there doesn't seem to be any significant changes. Poking other changes, it seems like https://gerrit.wikimedia.org/r/c/mediawiki/core/+/576858 has landed in .23 what is changing how the local server cache, injected into the MessageCache gets initialized. Looking at that change I can't see anything that might suggest a problem, so perhaps this is a wrong path, but still adding it here as a data point.

FYI: Message cache caches the content of all messages together under one key, except those which are larger than wgMaxMsgCacheEntrySize, which is by default 10000 bytes, but 1024 bytes in WMF production. So this seems to be just a lot of small and medium sized messages, not super large ones. Though, I am not sure if change in the large ones cause cache invalidation for the main key.

Apparently with wmf.23 the key enwiki:messages:en kept being fetched from Memcached. If some messages invalidate it, I would expect the key to be fetched a few times to repopulate the local caches, but there should not be much more activity beside that.

I went on the Beta-Cluster-Infrastructure which has the whole debug logs. Logstash is broken there but we still have the raw udp2log messages on deployment-fluorine02.deployment-prep.eqiad.wmnet

Looking at the MessageCache log bucket:

/srv/mw-log/MessageCache.log
2020-03-17 21:10:09 [XnE8sawQBHcAAB@hI4QAAAAW] deployment-mediawiki-07 enwiki 1.35.0-alpha MessageCache DEBUG: MessageCache using store APCUBagOStuff {"class":"APCUBagOStuff"} 
2020-03-17 21:10:09 [XnE8sawQBHcAAB@hI4QAAAAW] deployment-mediawiki-07 enwiki 1.35.0-alpha MessageCache DEBUG: MessageCache::load: Loading en... local cache has the wrong hash, got from global cache  
2020-03-17 21:10:09 [XnE8sawQBHcAAB@hI4UAAAAJ] deployment-mediawiki-07 enwiki 1.35.0-alpha MessageCache DEBUG: MessageCache using store APCUBagOStuff {"class":"APCUBagOStuff"} 
2020-03-17 21:10:09 [XnE8sawQBHcAAB@hI4UAAAAJ] deployment-mediawiki-07 enwiki 1.35.0-alpha MessageCache DEBUG: MessageCache::load: Loading en... local cache has the wrong hash, got from global cache  
...

The debug info is forged in MessageCache with the language code and an indication of the source of the data. The breakdown for en:

$ grep 'Loading en\.\.\.' MessageCache.log |cut -d: -f4-|sort|uniq -c |sort -nr
  49911  MessageCache::load: Loading en... local cache has the wrong hash, got from global cache  
   4603  MessageCache::load: Loading en... got from local cache  
    913  MessageCache::load: Loading en... local cache is empty, global cache is empty, loading from database  
     13  MessageCache::load: Loading en... local cache is empty, got from global cache  
      1  MessageCache::load: Loading en... local cache validation key is expired/volatile, global cache is presumed expired, loading from database  
      1  MessageCache::load: Loading en... local cache is expired, global cache is expired, loading from database  
      1  MessageCache::load: Loading en... local cache is empty, global cache is expired/volatile, loading from database  

We have logs up to 2020-03-02. I would expect got from local cache to always be the most frequent entry. That is the case by a huge margin until the logfile MessageCache.log-20200306.gz.

The first batches of entries for local cache has the wrong hash, got from global cache are around 2020-03-05 9:00 UTC. Which might match with a commit that got deployed on beta at that point and would have ended in wmf.23.

I am not sure whether that can help :/

The MessageCache::load: Loading en... local cache has the wrong hash, got from global cache does show up on mwdebug1001 with verbose log (they can then be retrieved in logstash). So I get either messages from the local cache or having a wrong hash and being fetched from global cache..

I have no idea why the cache varies. So that is all I have for today.

Mentioned in SAL (#wikimedia-operations) [2020-03-18T00:31:40Z] <Amir1> foreachwikiindblist medium deleteEqualMessages.php --delete (T247562)

Mentioned in SAL (#wikimedia-operations) [2020-03-18T00:31:40Z] <Amir1> foreachwikiindblist medium deleteEqualMessages.php --delete (T247562)

I'm deleting the values that are not needed anymore from that cache which would help with size of that cache value. There's nothing for enwiki though :/

This should be put in puppet and run regularly

[On beta], the first batches of entries for local cache has the wrong hash, got from global cache are around 2020-03-05 9:00 UTC. […]

I looked at the l10n-related commits, but nothing stands out there. Maybe some invisible threshold we hit, but I don't see it.

Our Jenkins logs for beta deployments have already expired from March 5th, so I couldn't figure out if Beta-deploys were broken (as they often are), in order to know what the last automatic deploys before 9AM were. I looked through the previous day of core commits and found this one from myself:

I looked through it many times but was unable to find any difference in behaviour.

Before
# DefaultSettings
$wgObjectCaches = [
'apcu' => [ 'class' => APCUBagOStuff::class, 'reportDupes' => false ],

];
# ObjectCache.php
public static function detectLocalServerCache() {
  /* … */ return 'apcu';
}
After
public static function makeLocalServerCache() {
  $params = [ 'reportDupes' => false ];
  return new APCUBagOStuff( $params );
}

Looks the same, right? I dumped both from a test script on deployment-mediawiki-07 in the Beta Cluster, and accessed it over HTTP:

w/krinkle.php
<?php
define( 'MW_NO_SESSION', 1 );
require_once __DIR__ . '/../multiversion/MWMultiVersion.php';
require MWMultiVersion::getMediaWiki( 'includes/WebStart.php' );
header( 'Content-Type: text/plain; charset=utf-8' );

var_dump(ObjectCache::detectLocalServerCache());
var_dump(ObjectCache::getInstance(ObjectCache::detectLocalServerCache()));
echo "\n\n";
var_dump(ObjectCache::makeLocalServerCache());
Output
string(4) "apcu"
object(APCUBagOStuff)#310 (17) {
  ["nativeSerialize":"APCUBagOStuff":private]=>
  bool(true)
  ["keyspace":protected]=>
  string(6) "enwiki"
  ["reportDupes":"MediumSpecificBagOStuff":private]=>
  bool(false)
  ["logger":protected]=>
  object(Monolog\Logger)#346 (5) {…}
  ["asyncHandler":protected]=>
  array(2) {
    [0]=>
    string(15) "DeferredUpdates"
    [1]=>
    string(17) "addCallableUpdate"
  }
}

object(APCUBagOStuff)#309 (17) {
  ["nativeSerialize":"APCUBagOStuff":private]=>
  bool(true)
  ["keyspace":protected]=>
  string(5) "local"               # <!-- BOOM
  ["reportDupes":"MediumSpecificBagOStuff":private]=>
  bool(false)
  ["logger":protected]=>
  object(Psr\Log\NullLogger)#308 (0) {
  }
  ["asyncHandler":protected]=>
  NULL
}

The main difference is that the keyspace attribute is no longer set. The keyspace is what differentiates wikis from each other when calling the non-shared makeKey() method for cache keys.

This array key isn't anywhere in the Before code above, because it was automagically injected by the ObjectCache::newFromParams factory.

Change 580577 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] objectcache: Restore 'keyspace' for LocalServerCache service

https://gerrit.wikimedia.org/r/580577

Change 580577 merged by jenkins-bot:
[mediawiki/core@master] objectcache: Restore 'keyspace' for LocalServerCache service

https://gerrit.wikimedia.org/r/580577

Change 580598 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@wmf/1.35.0-wmf.24] objectcache: Restore 'keyspace' for LocalServerCache service

https://gerrit.wikimedia.org/r/580598

Change 580599 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@wmf/1.35.0-wmf.23] objectcache: Restore 'keyspace' for LocalServerCache service

https://gerrit.wikimedia.org/r/580599

Those patches would do it though I am unable to deploy them this morning (kids are homeschooling). At best I am there at ~ 1pm UTC (4 hours from now).

If one has the bandwidth to deploy them and verify, please be bold.

Change 580599 merged by jenkins-bot:
[mediawiki/core@wmf/1.35.0-wmf.23] objectcache: Restore 'keyspace' for LocalServerCache service

https://gerrit.wikimedia.org/r/580599

Change 580598 merged by jenkins-bot:
[mediawiki/core@wmf/1.35.0-wmf.24] objectcache: Restore 'keyspace' for LocalServerCache service

https://gerrit.wikimedia.org/r/580598

I am going to get them deployed.

Mentioned in SAL (#wikimedia-operations) [2020-03-18T11:57:54Z] <hashar@deploy1001> Synchronized php-1.35.0-wmf.23/includes/objectcache/ObjectCache.php: objectcache: Restore keyspace for LocalServerCache service - T247562 (duration: 01m 10s)

Mentioned in SAL (#wikimedia-operations) [2020-03-18T11:59:09Z] <hashar@deploy1001> Synchronized php-1.35.0-wmf.24/includes/objectcache/ObjectCache.php: objectcache: Restore keyspace for LocalServerCache service - T247562 (duration: 01m 07s)

hashar assigned this task to Krinkle.

I have deployed the hotfixes. The reproducible case I had is fixed (mwdebug1001, browse www.mediawiki.org then meta.wikimedia.org).

Thanks @tstarling for the memcached spikes and the size/slab investigation (I have learned about slabs), @Krinkle to have figured out the source of the cache variance. And in general thank you everyone that commented on this task and helped on the investigation one way or an other \o/