Page MenuHomePhabricator

labtestweb2001: Memcached error for key on server "127.0.0.1:11213": SERVER HAS FAILED
Closed, ResolvedPublic

Description

Seems memcached / nutcracker is dead on labtest2001. That eventually causes log spam.

Details

Related Gerrit Patches:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 4 2018, 3:50 PM
jcrespo added a subscriber: jcrespo.Sep 4 2018, 4:08 PM

This may be a duplicate of T201082- while maybe it is a different issue, it is part of th brokenness of the labtestweb setup.

Krinkle added a subscriber: Krinkle.Sep 5 2018, 2:23 AM

@jcrespo The error message looks a bit confusing, but it's actually reporting a problem with a memcached server, not a database server. It is reporting that MediaWiki (on labtest2001) is unable to access the Memcached key WANCache:m:global:Wikimedia\Rdbms\LoadBalancer:server-read-only:db2037 from 127.0.0.1:11213 (mcrouter).

This seems like a genuine issue. Which means one of two things:

  1. an memcached server in codfw is done, one that mcrouter is routing to.
  2. mcrouter itself is down on labtest2001.
Krinkle renamed this task from labtestweb2001: Memcached error for key "WANCache:m:global:Wikimedia\Rdbms\LoadBalancer:server-read-only:db2037" on server "127.0.0.1:11213": SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY to labtestweb2001: Memcached error for key on server "127.0.0.1:11213": SERVER HAS FAILED.Sep 5 2018, 2:24 AM
Krinkle moved this task from Dec 2019 / 1.35.wmf.10+ to Meta on the Wikimedia-production-error board.

Magically I have access to the machine!

memcached is running and listening on port 11000

There is a process listening on 127.0.0.1:11212 which supposedly is nutcracker. That is used by the openstack_dasbhoard?

$ grep -R 11212 /etc
/etc/openstack-dashboard/local_settings.py:       'LOCATION' : '127.0.0.1:11212',
/etc/nagios/nrpe.d/check_nutcracker_port.cfg:command[check_nutcracker_port]=/usr/lib/nagios/plugins/check_tcp -H 127.0.0.1 -p 11212 --timeout=2
/etc/nutcracker/nutcracker.yml:  listen: 127.0.0.1:11212

In nutcracker, the memcached bucket listens on port 11212 and points to memcached on 11000.

MediaWiki has:

$ mwscript shell.php --wiki=labtestwiki
>>> $wgObjectCaches['memcached-pecl']['servers']
=> [
     "/var/run/nutcracker/nutcracker.sock:0",
   ]

Under HHVM we use a socket instead of port 11212.

Then:

>>> $wgObjectCaches['mcrouter']['servers']
=> [
     "127.0.0.1:11213",
   ]
>>>

In puppet:

hieradata/common/mcrouter.yaml:mcrouter::port: 11213
modules/profile/manifests/mediawiki/mcrouter_wancache.pp:    Integer $port = hiera('mcrouter::port'),

labtestweb2001.wikimedia.org has puppet roles:

role(wmcs::openstack::labtest::labweb)
include ::role::mariadb::labtestwikitech

From the wmcs::openstack::labtest::labweb role:

include ::profile::openstack::labtest::nutcracker
# Wikitech:
    include ::profile::openstack::labtest::wikitech::web

So I guess the role should include one of profile::mediawiki::mcrouter_wancache or role::mediawiki::common. I have not looked at how it is handled for the production wikitech site.

Change 458457 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/mediawiki-config@master] Fix condition for using nutcracker instead of mcrouter on wikitech

https://gerrit.wikimedia.org/r/458457

Joe claimed this task.Sep 6 2018, 8:49 AM
Joe triaged this task as High priority.

The problem is that labswebtest machines are configured to use labstestwiki, and that we didn't configure those to use their local nutcracker, but the global mcrouter, which doesn't make any sense.

The patch I uploaded fixes the issue. We should *not* install mcrouter here.

Change 458457 merged by jenkins-bot:
[operations/mediawiki-config@master] Fix condition for using nutcracker instead of mcrouter on wikitech

https://gerrit.wikimedia.org/r/458457

Mentioned in SAL (#wikimedia-operations) [2018-09-06T09:11:15Z] <oblivian@deploy1001> Synchronized wmf-config/mc.php: Fixing memcached configuration for labstestwiki T203479 (duration: 00m 56s)

Joe closed this task as Resolved.Sep 6 2018, 9:16 AM

Indeed. Thank you @Joe

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:09 PM