Page MenuHomePhabricator

Split-brain strategy for services that use config managed by etcd
Open, LowPublic

Description

Scripts that use config that switches based on which datacenter is authoritative* need to handle the case where they end up on the minority side of an etcd split-brain, thus causing the config to get stale.

If a script, swiftrepl for example, was running for minutes or hours based on stale config, it might be syncing containers in the wrong direction. It might see new files that are only in the "destination" as needing to be deleted, when in fact they should be *created* in the "source", because source/destination are reversed due to stale config. This would cause data loss. To some extent, this is a problem with any config change when you have long-running scripts. Even for short scripts, it's still a problem in the split-brain case. We need to make sure scripts/services are acting on up-to-date (within X seconds) config or stop running in order to avoid corruption. In the worst split-brain, ops admins can't even stop the scripts or change the config for servers on the minority side, so they need to already have had handling for this case (by stopping). If X is high, we'd want to be careful to wait it out before a DC switchover.

Assuming etcd is over the WAN in 3 places and no single datacenter can cause a quorum to fail, there are a few strategies.

[method I] One strategy is to make sure:
a) The client using etcd for config uses quorum reads on startup (so they are consistent or fail)
b) The client, if long-running, periodically likewise rechecks the config
c) The client aborts when the above fail (rather than catching errors or using process cached config or something)

[method II] Use heartbeats into etcd dataspace and non-quorum reads. Clients would abort if the config it too stale (defined by X seconds). This could lower latency.
a) The client using etcd for config checks it on startup and aborts if stale
b) The client, if long-running, periodically likewise rechecks the config
c) The client aborts when the above fail (rather than catching errors or using process cached config or something)

For daemons, they need to restart when the etcd problems resolve, of course.

The most error-prone part seems to be periodic checks for long-running scripts (e.g. swiftrepl). It seems easy to fail to account for how long certain parts of a script might take. Maybe hacking a monitor daemon that kills etcd-dependent scripts if the config is stale would be less error-prone...though that raises the question of what kill level to use (what if SIGINT isn't enough?) and what might break with higher levels (e.g. kill -9).

*"authoritative" could mean "handles writes" or where things like local Cassandra quorums are directed, or where reads-for-writes go.

See also the following tasks which are about adopting Etcd in the current active-inactive model:

Event Timeline

aaron raised the priority of this task from to Medium.
aaron updated the task description. (Show Details)

Side note: I know this is not on the scope of this ticket, but I feel very identified with db-eqiad.php and db-codfw.php, and long-running maintenance scripts using stale config.

aaron set Security to None.

The most error-prone part seems to be periodic checks for long-running scripts (e.g. swiftrepl). It seems easy to fail to account for how long certain parts of a script might take.

systemd has a heartbeat-based service monitoring facility. When enabled, the service has to periodically send a single UDP datagram to the socket stored in the $NOTIFY_SOCKET variable in its environment. Systemd can intervene and restart the service if too much time has passed since the last heartbeat. We can send a heartbeat every time we successfully retrieve configuration data from etcd.

Adapting this approach for Trusty would entail either some amount of custom work if we wanted the behavior to match systemd's, or we could simply have something closely analogous using existing tools. One possibility would be to use monit. Instead of sending a heartbeat packet, the process could touch a PID file, and we'd configure monit to check the file mtime:

check file foo_pid with path /run/foo.pid
  if timestamp > 10 seconds then exec "/usr/sbin/service foo restart"

See [[ https://mmonit.com/monit/documentation/monit.html#TIMESTAMP-TESTING

TIMESTAMP TESTING ]] in the monit manual.
ori lowered the priority of this task from Medium to Low.Nov 30 2015, 7:43 PM