Page MenuHomePhabricator

evaluate Cassandra-related impact of June 30 leap second event
Closed, ResolvedPublic

Description

A leap second event will occur on June 30th, 2015 at 23:59:60 UTC. The last leap second in 2012 wreaked havoc on some Cassandra, mostly as a result of a livelock in the Linux kernel. Evaluate what (if any) threat remains for the upcoming leap second, and plan accordingly.

See also: T103479

Event Timeline

Eevans claimed this task.
Eevans raised the priority of this task from to Needs Triage.
Eevans updated the task description. (Show Details)
Eevans added a project: RESTBase-Cassandra.
Eevans subscribed.

Most of the problems from 2012 were caused by the livelock bug in pre-3.4 Linux kernels. Java applications though were hit particularly hard because the JVM relied upon the wall-clock for thread parking. Neither of these should be a problem in our environment however, as we are using kernels > 3.4, and JVMs greater than version 7u60 (when it was patched to use an elapsed timer). See also: T103479.

However, Cassandra itself does depend on the wall clock, that timestamps are monotonically increasing, and that the nodes of a Cassandra cluster are closely time-synchronized, and this does create some possibility of inconsistency

My understanding is that The Plan is to disable NTP on the 29th, and re-enable it on the 1st. The idea (I think) being that the kernel would never receive the leap second adjustment, that the drift experienced during the NTP outage is bounded and minimal, and that it will be slowly and gradually corrected after NTP is restored (as opposed to having an entire second wholesale repeated).

Quesitons I still have:

  • When exactly on the 29th is NTP going down, and when exactly is it coming up? Will it be down 2 full days, or 24 hours and some minutes?
  • I've observed some pretty extreme cases of drift, do we have any experience with this on our hardware? Can we put a better number on the expected drift?
  • Is it too late to consider a leap smear?

Incurring the leap second event as-is runs some risk of triggering an unknown bug, and creates a single 1 second window where sub-second concurrency could result in an inconsistency. Is that more or less risk than say a hypothetical 24 hour window where nodes are hundreds of milliseconds apart?

(For-the-record: I think the risks associated with either are quite low, but in the interest of being thorough...)

My understanding from the recent mail thread and a discussion with Moritz this morning is that the answer to your third question is no, and that they are considering using ntpd -x to smear the leap. I asked them to apply that to all cassandra nodes at once, so that the time is adjusted in a coordinated manner.

Eevans set Security to None.