Page MenuHomePhabricator

Better handling of memcached service
Closed, ResolvedPublic

Description

When deploying memcached on the IDP test instances I noticed that we puppetise the systemd service. But leads to a number of potential breakages during updates, if a new memcached package is deployed, the puppetised version gets overwritten until the next Puppet run.

The memcached class already uses systemd::service, so we can deploy our specific ExecStart call via an override.

This would also allow us to benefit from changes to the memcached unit in Debian, the version present in the stock deb ships various hardening features which we currently miss.

Also, memcached in stretch and later ships a systemd-memcached-wrapper which simply points to /etc/memcached.conf, so it would also be an option to switch to that and possibly no longer customise the systemd unit at all.

Event Timeline

The current setup can even lead to startup failures, on one of the IDP hosts I ran into the issue that the Puppet run for the systemd unit raced with the service restart triggered by the package update:

jmm@idp-test2001:~$ sudo journalctl -u memcached
-- Logs begin at Wed 2020-06-10 22:38:36 UTC, end at Thu 2020-06-11 12:32:41 UTC. --
Jun 11 11:58:13 idp-test2001 memcached[28612]: Signal handled: Terminated.
Jun 11 11:58:13 idp-test2001 systemd[1]: Stopping memcached daemon...
Jun 11 11:58:13 idp-test2001 systemd[1]: memcached.service: Succeeded.
Jun 11 11:58:13 idp-test2001 systemd[1]: Stopped memcached daemon.
Jun 11 11:58:14 idp-test2001 systemd[1]: Started memcached daemon.
Jun 11 11:59:15 idp-test2001 systemd[1]: memcached.service: Current command vanished from the unit file, execution of the command list won't be resumed.
jbond triaged this task as Medium priority.Jun 11 2020, 1:05 PM

Change 605878 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] memcached: convert systemd service file to an override

https://gerrit.wikimedia.org/r/605878

Change 605878 merged by Jbond:
[operations/puppet@production] memcached: convert systemd service file to an override

https://gerrit.wikimedia.org/r/605878

@jbond @MoritzMuehlenhoff I can file a separate ticket for this, but I think this may be helpful to report here first.

I just got the same "error":

Nov 25 16:41:48 cumin2001 systemd[1]: regular_snapshot.service: Current command vanished from the unit file, execution of the command list won't be resumed.

This made the backup batch not run for codfw. Happily, we have backup monitoring to catch that. The same change was applied to eqiad (cumin1001 and cumin2001 have the same profile) but backups ran without an issue there. This worries me because I would expect to either the change working or producing a systemd daemon/service/timer error, not failing silently. Does the puppet timer code need some changes to prevent race conditions, or maybe documenting not to update timers, and disable and reenable them again? Or maybe the patch was not done correctly?

Please check cumin2001 logs, or I can create a separate task for this.

@jbond @MoritzMuehlenhoff I can file a separate ticket for this, but I think this may be helpful to report here first.

I just got the same "error":

Nov 25 16:41:48 cumin2001 systemd[1]: regular_snapshot.service: Current command vanished from the unit file, execution of the command list won't be resumed.

This made the backup batch not run for codfw. Happily, we have backup monitoring to catch that. The same change was applied to eqiad (cumin1001 and cumin2001 have the same profile) but backups ran without an issue there. This worries me because I would expect to either the change working or producing a systemd daemon/service/timer error, not failing silently. Does the puppet timer code need some changes to prevent race conditions, or maybe documenting not to update timers, and disable and reenable them again? Or maybe the patch was not done correctly?

The issue for the memcached service was different, but the root cause is likely the same:

For memcached we had a systemd service which was shipped in the Debian package (/lib/systemd/system/memcached.service) which was overwritten with settings needed by the IDP setup, which was also written to /lib/systemd/system/memcached.service. The core issue here is that after updating the memcached deb we again get the version shipped by Debian (until the next Puppet run kicks in). In addition we once ran into the race from https://phabricator.wikimedia.org/T255132#6214939 where Puppet run which wrote https://phabricator.wikimedia.org/T255132#6214939 occurred at the same time the daemon was reloaded (small race window, but can happen). For the service unit the correct fix was to move switch to systemd service overrides, so that we only apply the WMF-specific settings on top of what Debian ships as memcached.service.

For regular_snapshot.service commit d684269d81 changed the command configured for the timer, which also updated the systemd::unit defined in systemd::timer::job. But in this case we don't run systemctl daemon-reload, so the updated unit wasn't picked up, I think to properly fix this we do an exec for systemctl daemon-reload with refreshonly=>true.

Thanks.

we do an exec for systemctl daemon-reload with refreshonly=>true

Is this for the caller code or the systemd::timer ? I can help with this, and test it.

Thanks.

we do an exec for systemctl daemon-reload with refreshonly=>true

Is this for the caller code or the systemd::timer ? I can help with this, and test it.

Actually digging a little deeper; systemd::unit already runs a daemon-reload by default, so it must be something else, maybe we just hit a small/unfortunate time window while a systemd timer tried to execute the unit while it was reloaded or so?

maybe we just hit a small/unfortunate time window while a systemd timer tried to execute the unit while it was reloaded or so

Not the case- the timer didn't start on cumin2001 on the next iteration- it is dead, no execution since the 25th (no backups on codfw). I believe the code for the timer or the one calling it is not working as intended.

I did a disable and enable on the timer to see if that helps.

akosiaris added subscribers: Dzahn, akosiaris.

@Dzahn, Judging from the content of the task, this is for Infrastructure-Foundations, not serviceops, retagging.

That being said, last update on Nov 30 2020, more than 2.5 years ago. @jcrespo , @MoritzMuehlenhoff is there anything left to be done here. Should we resolve this?

@Dzahn, Judging from the content of the task, this is for Infrastructure-Foundations, not serviceops, retagging.

That being said, last update on Nov 30 2020, more than 2.5 years ago. @jcrespo , @MoritzMuehlenhoff is there anything left to be done here. Should we resolve this?

I think this can be closed, we fixed the gist of the issue with adding a systemd override so that only the WMF-specific options are applied on top of the systemd unit provided by Debian/upstream. Debian has a Debian-specific mechanism called the systemd-memcached-wrapper, which was mentioned in the task, but the override fixes this equally well for the IDP use case. I'm fine with resolving, but let's wait until @jcrespo is back.

ACK, thanks Alex, you are right. And Moritz, sounds good to me to close it then.

Just returned today. If Moritz's issue was fixed, not a blocker on my side, I had not had the issue I commented on myself recently.

jcrespo assigned this task to MoritzMuehlenhoff.