Page MenuHomePhabricator

deployment-mediawiki02 lost memcached access at 11:51am UTC
Closed, ResolvedPublic

Description

T127964 is a complaint of not being able to login on beta commons wiki. Looking at logstash I noticed since 11:51am a surge of messages of type:

Memcached error for key "{memcached-key}" on server "{memcached-server}": SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY
Memcached error for key "{memcached-key}" on server "{memcached-server}": CONNECTION FAILURE

The puppet run on deployment-mediawiki02 has a diff at that time:

Info: Applying configuration version '1456314632'
Notice: /Stage[main]/Nutcracker/File[/etc/nutcracker/nutcracker.yml]/content: 
--- /etc/nutcracker/nutcracker.yml      2015-10-08 00:50:36.322423911 +0000
+++ /tmp/puppet-file20160224-24403-qa3hxp       2016-02-24 11:50:56.379159631 +0000
@@ -35,8 +35,6 @@
   server_failure_limit: 3
   server_retry_timeout: 30000
   servers:
-    - 10.68.16.177:6379:1
-    - 10.68.16.231:6379:1
   timeout: 1000
 redis_eqiad:
   auto_eject_hosts: true
@@ -49,6 +47,6 @@
   server_failure_limit: 3
   server_retry_timeout: 30000
   servers:
-    - 10.68.16.177:6379:1
-    - 10.68.16.231:6379:1
+    - 10.68.16.177:6379:1 "shard01"
+    - 10.68.16.231:6379:1 "shard02"
   timeout: 1000

Event Timeline

A puppet run on deployment-mediawiki02:

Notice: /Stage[main]/Nutcracker/Service[nutcracker]/ensure: ensure changed 'stopped' to 'running'
Info: /Stage[main]/Nutcracker/Service[nutcracker]: Unscheduling refresh on Service[nutcracker]

Or in other terms, the Nutcracker proxy to memcached servers is refusing to start... /var/log/upstart/nutcracker.log has:

nutcracker: configuration file '/etc/nutcracker/nutcracker.yml' syntax is invalid

The redis_codfw lacks a list of servers, that might be it:

redis_codfw:
  ...
  servers:
  timeout: 1000

Might be due to the fix for T127845: deployment-tin puppet Error 400 on SERVER: Failed to parse template nutcracker/nutcracker.yml.erb https://gerrit.wikimedia.org/r/272956 . That removed codfw:

https://gerrit.wikimedia.org/r/#/c/272956/2/hieradata/labs/deployment-prep/common.yaml,cm

/etc/nutcracker/nutcracker.yml on deployment-mediawiki02:

mc-unix:
  auto_eject_hosts: true
  distribution: ketama
  hash: md5
  listen: /var/run/nutcracker/nutcracker.sock 0666
  preconnect: true
  server_connections: 1
  server_failure_limit: 3
  server_retry_timeout: 30000
  servers:
    - 10.68.16.14:11211:1
    - 10.68.16.15:11211:1
  timeout: 250
memcached:
  auto_eject_hosts: true
  distribution: ketama
  hash: md5
  listen: 127.0.0.1:11212
  preconnect: true
  server_connections: 1
  server_failure_limit: 3
  server_retry_timeout: 30000
  servers:
    - 10.68.16.14:11211:1
    - 10.68.16.15:11211:1
  timeout: 250
redis_codfw:
  auto_eject_hosts: true
  distribution: ketama
  hash: md5
  listen: /var/run/nutcracker/redis_codfw.sock 0666
  redis: true
  redis_auth: XXXXXXXXXXX
  server_connections: 1
  server_failure_limit: 3
  server_retry_timeout: 30000
  servers:
  timeout: 1000
redis_eqiad:
  auto_eject_hosts: true
  distribution: ketama
  hash: md5
  listen: /var/run/nutcracker/redis_eqiad.sock 0666
  redis: true
  redis_auth: XXXXXXXXXXX
  server_connections: 1
  server_failure_limit: 3
  server_retry_timeout: 30000
  servers:
    - 10.68.16.177:6379:1 "shard01"
    - 10.68.16.231:6379:1 "shard02"
  timeout: 1000