Page MenuHomePhabricator

Nutcracker doesn't start at boot
Closed, ResolvedPublic

Description

Noticed this today on mw2017, nutcracker will be started on the first puppet run but not automatically at boot

mw2017:~$ ps fwuax | grep -i nutcracker
filippo   3516  0.0  0.0  12728  2172 pts/0    S+   15:37   0:00              \_ grep -i nutcracker
nutcrac+  2085  0.0  0.0  20760  2696 ?        Ssl  15:37   0:00 /usr/sbin/nutcracker --verbose=4 --mbuf-size=65536
mw2017:~$ uptime
 15:37:47 up 2 min,  1 user,  load average: 0.85, 0.72, 0.30
mw2017:~$ date
Tue Apr 25 15:37:49 UTC 2017
mw2017:~$ grep nutcracker /var/log/puppet.log
grep: /var/log/puppet.log: Permission denied
mw2017:~$ sudo !!
sudo grep nutcracker /var/log/puppet.log
Notice: /Stage[main]/Nutcracker/Service[nutcracker]/ensure: ensure changed 'stopped' to 'running'
Info: /Stage[main]/Nutcracker/Service[nutcracker]: Unscheduling refresh on Service[nutcracker]

Event Timeline

That's somewhat expected, at this point nutcracker is not enabled for automatic service startup:

jmm@mw1293:~$ sudo systemctl is-enabled nutcracker
disabled

And the puppet module then starts it (from modules/nutcracker/manifests/init.pp):

service { 'nutcracker':
    ensure => ensure_service($ensure),
}

I'm not sure if that's intended, it might simply be a side effect of the trusty -> jessie migration. I fortunately forgot most I ever knew about Upstart, but nutcracker probably had an automatic service startup under trusty? Adding @Joe and @ori who probably know the history best.

Convering nutcracker to base::service_unit is likely the best fix here.

Also, since HHVM in our current config unconditionally tries to connect to unix::/var/run/nutcracker/redis_$DC.sock would could also add a "After=nutcracker.service" to HHMV's unit.

Also, since HHVM in our current config unconditionally tries to connect to unix::/var/run/nutcracker/redis_$DC.sock would could also add a "After=nutcracker.service" to HHMV's unit.

Sounds like the right thing to do

I dug a little deeper and this turned out to be a subtle packaging bug / debhelper oddity:

The current debian/rules file uses:

dh $@  --with autoreconf --with-systemd

Looks totally benign, right? Here's the postinst which gets generated by dh from that:

# Automatically added by dh_installinit
# In case this system is running systemd, we need to ensure that all
# necessary tmpfiles (if any) are created before starting.
if [ -d /run/systemd/system ] ; then
        systemd-tmpfiles --create /usr/lib/tmpfiles.d/nutcracker.conf >/dev/null || true
fi
# End automatically added section
# Automatically added by dh_installinit
if [ -x "/etc/init.d/nutcracker" ]; then
        update-rc.d nutcracker defaults >/dev/null
        invoke-rc.d nutcracker start || exit $?
fi
# End automatically added section

But if we change the dh call to this:

dh $@  --with autoreconf,systemd

Then after a rebuild, debhelper inserts the correct code to activate and start nutcracker:

# Automatically added by dh_systemd_enable
# This will only remove masks created by d-s-h on package removal.
deb-systemd-helper unmask nutcracker.service >/dev/null || true

# was-enabled defaults to true, so new installations run enable.
if deb-systemd-helper --quiet was-enabled nutcracker.service; then
        # Enables the unit on first installation, creates new
        # symlinks on upgrades if the unit file has changed.
        deb-systemd-helper enable nutcracker.service >/dev/null || true
else
        # Update the statefile to add new symlinks (if any), which need to be
        # cleaned up on purge. Also remove old symlinks.
        deb-systemd-helper update-state nutcracker.service >/dev/null || true
fi
# End automatically added section
# Automatically added by dh_installinit
# In case this system is running systemd, we need to ensure that all
# necessary tmpfiles (if any) are created before starting.
if [ -d /run/systemd/system ] ; then
        systemd-tmpfiles --create /usr/lib/tmpfiles.d/nutcracker.conf >/dev/null || true
fi
# End automatically added section
# Automatically added by dh_installinit
if [ -x "/etc/init.d/nutcracker" ]; then
        update-rc.d nutcracker defaults >/dev/null
        invoke-rc.d nutcracker start || exit $?
fi
# End automatically added section

I'll upload a fixed package to jessie-wikimedia after some more tests in codfw.

The new package fixes that (tested by rebooting two servers with and without the new package:)

root@mw2222:~# dpkg -l nutcracker
ii nutcracker 0.4.1-1+wm3~jessie0 amd64 Fast, light-weight proxy for memcached and Redis
root@mw2222:~# grep "Failed connecting to redis server" /var/log/hhvm/error.log | wc -l
26

root@mw2220:~# dpkg -l nutcracker
ii nutcracker 0.4.1-1+wm3~jessie1 amd64 Fast, light-weight proxy for memcached and Redis
root@mw2220:~# grep "Failed connecting to redis server" /var/log/hhvm/error.log | wc -l
0

Mentioned in SAL (#wikimedia-operations) [2017-05-19T11:05:09Z] <moritzm> uploaded nutcracker 0.4.1-1+wm3~jessie1 to apt.wikimedia.org (T163795)

stretch now has 0.4.1 (prepared/maintained by yours truly) and I just checked, doesn't suffer from this bug. The right way here would be for us to switch to that, or if we have local patches, rebase them on top of stretch's rather than keep stacking patches on top of our fork.

I've filed T166038 for rebasing to the stretch 0.4.1 package. I'll proceed with rolling out the current isolated fix in the mean time; with the current behaviour nutcracker is tripping Icinga alerts if more than one mw* server is rebooted at a time.

Mentioned in SAL (#wikimedia-operations) [2017-05-23T14:03:50Z] <moritzm> installing nutcracker update in codfw (T163795)

Fixed package is fully rolled out now.