nutcracker fails to start due to lack of /var/run/nutcracker (ex: deployment-videoscaler01 has memcached failures)
Closed, ResolvedPublic

Description

Memcached error for key "{memcached-key}" on server "{memcached-server}": SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY

deployment-videoscaler01:~# systemctl list-units --state=failed
  UNIT               LOAD   ACTIVE SUB    DESCRIPTION                             
● ferm.service       loaded failed failed ferm firewall configuration             
● nutcracker.service loaded failed failed nutcracker proxy for memcached and Redis
● puppet.service     loaded failed failed Puppet agent                            
● rc-local.service   loaded failed failed /etc/rc.local Compatibility
hashar created this task.Oct 18 2017, 8:21 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 18 2017, 8:21 AM

ferm is a different issue due to AAAA DNS resolution which is not available on labs. That is T176314 and worked around via https://gerrit.wikimedia.org/r/#/c/381073/

/var/log/nutcracker/nutcracker.log
[2017-10-18 08:28:35.263] nc.c:184 nutcracker-0.4.1 built for Linux 4.9.0-3-amd64 x86_64 started on pid 4586
[2017-10-18 08:28:35.264] nc.c:189 run, rabbit run / dig that hole, forget the sun / and when at last the work is done / don't sit down / it's time to dig another one
[2017-10-18 08:28:35.272] nc_proxy.c:148 bind on p 11 to addr '/var/run/nutcracker/nutcracker.sock 0666' failed: No such file or directory
[2017-10-18 08:28:35.272] nc.c:195 done, rabbit done

/var/run/nutcracker/nutcracker.sock 0666 failed: No such file or directory

/etc/nutcracker/nutcracker.yml
mc-unix:
  listen: /var/run/nutcracker/nutcracker.sock 0666
redis_eqiad:
  listen: /var/run/nutcracker/redis_eqiad.sock 0666

I guess we have nutcracker configured to that path but the systemd unit fails to create the /var/run/nutcracker directory before starting the unit.

Mentioned in SAL (#wikimedia-releng) [2017-10-18T08:38:20Z] <hashar> deployment-videoscaler01: install --owner=nutcracker -d /var/run/nutcracker && systemctl start nutcracker # T178457

hashar renamed this task from deployment-videoscaler01 has memcached failures to nutcracker fails to start due to lack of /var/run/nutcracker (ex: deployment-videoscaler01 has memcached failures).Oct 18 2017, 8:39 AM
hashar added a project: Operations.

Fixed it by MANUALLY creating a /var/run/nutcracker directory.

Mentioned in SAL (#wikimedia-releng) [2017-10-18T08:41:22Z] <hashar> deployment-mediawiki07: install --owner=nutcracker -d /var/run/nutcracker && systemctl start nutcracker # T178457

faidon added a subscriber: faidon.Oct 18 2017, 12:17 PM

nutcracker ships /usr/lib/tmpfiles.d/nutcracker.conf which should be creating the file in (/var)/run. This has been working in production fine for months now. Not sure why it doesn't work in your case, could you troubleshoot a little more and provide more information?

deployment-videoscaler01 is one of the two servers experimentally using stretch, it's not comparable to what we use on the video scalers in production.

Gilles added a subscriber: Gilles.

Videoscalers don't run Thumbor

Ah! Yes, that all makes sense now, thanks!

We have a patched nutcracker version in jessie-wikimedia, which we haven't ported to stretch-wikimedia. Debian stretch has a new enough nutcracker version (0.4.1) and ships a systemd unit now too, but not a tmpfiles.d, because the upstream package doesn't really use /run/nutcracker at all (we only use it for the sockets, set in nutcracker's config).

We should probably leave the stretch package (0.4.1+dfsg-1) and possibly even backport that to jessie-wikimedia. The tmpfiles.d stanza should be moved to puppet instead, as it's referring to site-local configuration.

Change 384980 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Create /run/nutcracker on stretch onwards

https://gerrit.wikimedia.org/r/384980

@Gilles sorry for the spam

I havent investigated much beside the few comments above. I have learned today about tmpfiles.d and systemd-tmpfiles. That seems a good way to fix it for us :]

Change 384980 merged by Muehlenhoff:
[operations/puppet@production] Create /run/nutcracker on stretch onwards

https://gerrit.wikimedia.org/r/384980

hashar closed this task as Resolved.

I have rebooted deployment-videoscaler01

$ sudo systemctl --failed --all
0 loaded units listed.

Nutcraker is running just fine.

Danke!

Wouldn’t it be cleaner to have a RuntimeDirectory=nutcracker directive in nutcracker.service instead of a separate tmpfiles config? (RuntimeDirectory= was added in systemd v211, so it should be available even on Jessie if I’m not mistaken.)

Mentioned in SAL (#wikimedia-releng) [2018-04-18T22:53:15Z] <eddiegp> root@deployment-jobrunner03:/var/run# mkdir nutcracker && chown nutcracker.nutcracker nutcracker ref T178457

Saw the same on deployment-jobrunner03 today (fun note: found this task because google listed it when I searched what 'run, rabbit run' refers to) and had to create the directory manually there to get nutcracker to start