Page MenuHomePhabricator

Memcached error "SERVER ERROR" from nutcracker
Closed, DeclinedPublic

Description

Since 01:00:00 UTC today, mediawiki-errors and memcached dashboards in logstash are flooding with:

https://logstash.wikimedia.org/#/dashboard/elasticsearch/memcached

Memcached error for key "commonswiki:pcache:idhash:28455663-0!*!0!!en!*!*" on server "/var/run/nutcracker/nutcracker.sock:0": SERVER ERROR

Several hits per minute, continuous for the past 2 hours.

Most entries are coming from job runners for commonswiki for jobs refreshLinks or cirrusSearchLinksUpdatePrioritized.

May be related:

Event Timeline

Krinkle created this task.Dec 11 2015, 2:17 AM
Krinkle raised the priority of this task from to Needs Triage.
Krinkle updated the task description. (Show Details)
Krinkle added a subscriber: Krinkle.
Restricted Application added subscribers: StudiesWorld, Steinsplitter, Aklapper. · View Herald TranscriptDec 11 2015, 2:17 AM
bd808 added a subscriber: bd808.Dec 11 2015, 2:52 AM

For this same key there were also these errors:

  • Memcached error for key "commonswiki:pcache:idhash:28455663-0!*!0!!en!*!*" on server "/var/run/nutcracker/nutcracker.sock:0": ITEM TOO BIG
  • Memcached error for key "commonswiki:pcache:idhash:28455663-0!*!0!!en!*!*" on server "/var/run/nutcracker/nutcracker.sock:0": A TIMEOUT OCCURRED
ori added a subscriber: ori.Mar 22 2016, 10:01 PM

The root cause of this issue is explained in https://github.com/twitter/twemproxy/blob/ef45313/src/nc_response.c#L162-L179.

When a client tries to send a value to memcached that exceeds its size limit, memcached will send an ITEM TOO BIG error response before it has received the entire request. Nutcracker can use a single server connection to proxy multiple client connections, and the out-of-band response from memcached can confuse nutcracker into sending a server reply to the wrong client connection (see issue #149: server response during request send can cause client request/response mismatch). To prevent that from happening, Nutcracker will send a generic SERVER ERROR in response to all requests which have been enqueued for that connection. The reason we see it happen on the jobrunners is that it's only there that the memcached request parallelism is enough that requests get queued.

We can increase the number of server connections available to Nutcracker on the jobrunners, but there is no way to prevent this issue systematically, because there is no way for PHP code to know the compressed payload size before it is sent to memcached.

You can verify that this is indeed what is happening by looking at nutcracker.log; there is an entry like this for every entry in memcached.log on fluorine:

[2016-03-22 20:25:10.815] nc_response.c:159 filter stray rsp 1618296121 len 41 on s 19

ori added a comment.Mar 22 2016, 10:38 PM

It's always 'len 41' btw, which matches the length of "SERVER_ERROR object too large for cache\r\n"

Krinkle renamed this task from Memcached error for key "commonswiki:pcache:idhash:28455663-0!*!0!!en!*!*" on server "/var/run/nutcracker SERVER ERROR to Memcached error "SERVER ERROR" from nutcracker .Jul 6 2017, 4:41 AM

Still seeing these in the logs.

A few aggregated entries of common offenders. Seems like it doesn't relate to any group of keys in particular.

SERVER ERROR – Memcached error for key "WANCache:v:<wiki>:page: .. " on server "/var/run/nutcracker/nutcracker.sock:0"
SERVER ERROR – Memcached error for key "WANCache:v:<wiki>:file: .. " on server "/var/run/nutcracker/nutcracker.sock:0"
SERVER ERROR – Memcached error for key "WANCache:v:<wiki>:revisiontext:textid: .. " on server "/var/run/nutcracker/nutcracker.sock:0"
SERVER ERROR – Memcached error for key "WANCache:v:<wiki>:revisiontext:textid: .. " on server "/var/run/nutcracker/nutcracker.sock:0"
SERVER ERROR – Memcached error for key "WANCache:v:global:revision:<wiki>: .. " on server "/var/run/nutcracker/nutcracker.sock:0"
SERVER ERROR – Memcached error for key "<wiki>:prepared-edit: .. " on server "/var/run/nutcracker/nutcracker.sock:0"
SERVER ERROR – Memcached error for key "<wiki>:prepared-edit: .. " on server "/var/run/nutcracker/nutcracker.sock:0"
SERVER ERROR – Memcached error for key "<wiki>:prepared-edit: .. " on server "/var/run/nutcracker/nutcracker.sock:0"
SERVER ERROR – Memcached error for key "<wiki>:pcache: .. " on server "/var/run/nutcracker/nutcracker.sock:0"
SERVER ERROR – Memcached error for key "<wiki>:pcache: .. " on server "/var/run/nutcracker/nutcracker.sock:0"
SERVER ERROR – Memcached error for key "<wiki>:pcache: .. " on server "/var/run/nutcracker/nutcracker.sock:0"
SERVER ERROR – Memcached error for key "<wiki>:textextracts: .. " on server "/var/run/nutcracker/nutcracker.sock:0"

..
ITEM TOO BIG – Memcached error for key "<wiki>:prepared-edit: .. " on server "/var/run/nutcracker/nutcracker.sock:0"
ITEM TOO BIG – Memcached error for key "<wiki>:prepared-edit: .. " on server "/var/run/nutcracker/nutcracker.sock:0"
ITEM TOO BIG – Memcached error for key "<wiki>:prepared-edit: .. " on server "/var/run/nutcracker/nutcracker.sock:0"
..
ITEM TOO BIG – Memcached error for key "<wiki>:pcache: .. " on server "/var/run/nutcracker/nutcracker.sock:0"
bd808 removed a subscriber: bd808.Jul 6 2017, 5:17 AM
Krinkle closed this task as Declined.Feb 22 2018, 8:00 AM

Closing for now in favour more recent and more specific tasks about memcached/nutcracker issues in wmf-production.

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:11 PM