Page MenuHomePhabricator

nutcracker fails to start due to lack of /var/run/nutcracker (ex: deployment-videoscaler01 has memcached failures)
Closed, ResolvedPublic

Description

Memcached error for key "{memcached-key}" on server "{memcached-server}": SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY

deployment-videoscaler01:~# systemctl list-units --state=failed
  UNIT               LOAD   ACTIVE SUB    DESCRIPTION                             
● ferm.service       loaded failed failed ferm firewall configuration             
● nutcracker.service loaded failed failed nutcracker proxy for memcached and Redis
● puppet.service     loaded failed failed Puppet agent                            
● rc-local.service   loaded failed failed /etc/rc.local Compatibility

Event Timeline

ferm is a different issue due to AAAA DNS resolution which is not available on labs. That is T176314 and worked around via https://gerrit.wikimedia.org/r/#/c/381073/

/var/log/nutcracker/nutcracker.log
[2017-10-18 08:28:35.263] nc.c:184 nutcracker-0.4.1 built for Linux 4.9.0-3-amd64 x86_64 started on pid 4586
[2017-10-18 08:28:35.264] nc.c:189 run, rabbit run / dig that hole, forget the sun / and when at last the work is done / don't sit down / it's time to dig another one
[2017-10-18 08:28:35.272] nc_proxy.c:148 bind on p 11 to addr '/var/run/nutcracker/nutcracker.sock 0666' failed: No such file or directory
[2017-10-18 08:28:35.272] nc.c:195 done, rabbit done

/var/run/nutcracker/nutcracker.sock 0666 failed: No such file or directory

/etc/nutcracker/nutcracker.yml
mc-unix:
  listen: /var/run/nutcracker/nutcracker.sock 0666
redis_eqiad:
  listen: /var/run/nutcracker/redis_eqiad.sock 0666

I guess we have nutcracker configured to that path but the systemd unit fails to create the /var/run/nutcracker directory before starting the unit.

Mentioned in SAL (#wikimedia-releng) [2017-10-18T08:38:20Z] <hashar> deployment-videoscaler01: install --owner=nutcracker -d /var/run/nutcracker && systemctl start nutcracker # T178457

hashar renamed this task from deployment-videoscaler01 has memcached failures to nutcracker fails to start due to lack of /var/run/nutcracker (ex: deployment-videoscaler01 has memcached failures).Oct 18 2017, 8:39 AM
hashar added a project: SRE.

Fixed it by MANUALLY creating a /var/run/nutcracker directory.

Mentioned in SAL (#wikimedia-releng) [2017-10-18T08:41:22Z] <hashar> deployment-mediawiki07: install --owner=nutcracker -d /var/run/nutcracker && systemctl start nutcracker # T178457

nutcracker ships /usr/lib/tmpfiles.d/nutcracker.conf which should be creating the file in (/var)/run. This has been working in production fine for months now. Not sure why it doesn't work in your case, could you troubleshoot a little more and provide more information?

deployment-videoscaler01 is one of the two servers experimentally using stretch, it's not comparable to what we use on the video scalers in production.

Gilles subscribed.

Videoscalers don't run Thumbor

Ah! Yes, that all makes sense now, thanks!

We have a patched nutcracker version in jessie-wikimedia, which we haven't ported to stretch-wikimedia. Debian stretch has a new enough nutcracker version (0.4.1) and ships a systemd unit now too, but not a tmpfiles.d, because the upstream package doesn't really use /run/nutcracker at all (we only use it for the sockets, set in nutcracker's config).

We should probably leave the stretch package (0.4.1+dfsg-1) and possibly even backport that to jessie-wikimedia. The tmpfiles.d stanza should be moved to puppet instead, as it's referring to site-local configuration.

Change 384980 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Create /run/nutcracker on stretch onwards

https://gerrit.wikimedia.org/r/384980

@Gilles sorry for the spam

I havent investigated much beside the few comments above. I have learned today about tmpfiles.d and systemd-tmpfiles. That seems a good way to fix it for us :]

Change 384980 merged by Muehlenhoff:
[operations/puppet@production] Create /run/nutcracker on stretch onwards

https://gerrit.wikimedia.org/r/384980

hashar assigned this task to MoritzMuehlenhoff.

I have rebooted deployment-videoscaler01

$ sudo systemctl --failed --all
0 loaded units listed.

Nutcraker is running just fine.

Danke!

Wouldn’t it be cleaner to have a RuntimeDirectory=nutcracker directive in nutcracker.service instead of a separate tmpfiles config? (RuntimeDirectory= was added in systemd v211, so it should be available even on Jessie if I’m not mistaken.)

Mentioned in SAL (#wikimedia-releng) [2018-04-18T22:53:15Z] <eddiegp> root@deployment-jobrunner03:/var/run# mkdir nutcracker && chown nutcracker.nutcracker nutcracker ref T178457

Saw the same on deployment-jobrunner03 today (fun note: found this task because google listed it when I searched what 'run, rabbit run' refers to) and had to create the directory manually there to get nutcracker to start