Page MenuHomePhabricator

Ensure kernel and OpenJDK fixes for leap second are present
Closed, ResolvedPublic

Description

The next leap second will occur on June 30th, 2015 at 23:59:60 UTC. The last leap second in 2012 unveiled a livelock in the Linux kernel. This was fixed in Linux 3.4 and backported to older kernels with these two fixes:

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=6b43ae8a619d17c4935c3320d2ef9e92bdeed05d
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=6b1859dba01c7

It needs to be checked that all precise hosts have a kernel running with these patches.

In addition, the Java developers also changed their clock-handling: https://bugs.openjdk.java.net/browse/JDK-6900441
These fixed OpenJDK packages should all be rolled out as part of previous security updates, but that needs to be double-checked.

Java 6

The Java fix is present in all our openjdk-7 packages and the few systems with a openjdk-8 backport. This covers the complex services like Hadoop, Cassandra and Elastic.

However, Oracle didn't backport the fix to openjdk-6. The following systems use openjdk-6. In the worst case we need to restart some services, but possibly most of these should be moved to Java 7 anyway (e.g. lanthanum, since other part of Jenkins use Java 7 already).

  • labcontrol2001.wikimedia.org
  • ytterbium.wikimedia.org - T103668
  • labsdb1004.eqiad.wmnet
  • zirconium.wikimedia.org
  • labsdb1006.eqiad.wmnet
  • lanthanum.eqiad.wmnet - T103491

In addition three further systems have openjdk-6 installed, but the standard Java is based on 7 (based on java -version). I'll clean these up:

  • nembus.wikimedia.org
  • neptunium.wikimedia.org
  • gallium.wikimedia.org - T103491

Event Timeline

MoritzMuehlenhoff claimed this task.
MoritzMuehlenhoff raised the priority of this task from to Needs Triage.
MoritzMuehlenhoff updated the task description. (Show Details)
MoritzMuehlenhoff subscribed.

All our 3.2 kernels (and also sodium's 2.6.32) have the livelock fix (6b43ae8a619d17c4935c3320d2ef9e92bdeed05d). The system with the oldest kernel (es1006) was really close, though (it's uptime is 1039 days and Ubuntu pushed their fix two weeks before that).

The followup fix (6b1859dba01c7) only seems to apply to 3.4 and wasn't backported to older kernels.

The Java fix is present in all our openjdk-7 packages and the few systems with a openjdk-8 backport. This covers the complex services like Hadoop, Cassandra and Elastic.

However, Oracle didn't backport the fix to openjdk-6. The following systems use openjdk-6. In the worst case we need to restart some services, but possibly most of these should be moved to Java 7 anyway (e.g. lanthanum, since other part of Jenkins use Java 7 already).
labcontrol2001.wikimedia.org
ytterbium.wikimedia.org
labsdb1004.eqiad.wmnet
zirconium.wikimedia.org
labsdb1006.eqiad.wmnet
lanthanum.eqiad.wmnet

In addition three further systems have openjdk-6 installed, but the standard Java is based on 7 (based on java -version). I'll clean these up:
nembus.wikimedia.org
neptunium.wikimedia.org
gallium.wikimedia.org

hashar set Security to None.
hashar subscribed.

@MoritzMuehlenhoff I copy pasted your last comment ( T103479#1391521 ) to the task detail and added some checkbox in front of each machine.

I guess you want to fill sub tasks for each of them.

ytterbium.wikimedia.org is Gerrit production host. Gerrit itself uses Java 7, so we can probably just purge Java 6: T103668: Remove Java 6 from ytterbium.wikimedia.org (Gerrit production host)

@MoritzMuehlenhoff: is the plan to smear the leap, or will we let it happen normally?

In principle it does make a difference to Cassandra as it's using timestamps for conflict resolution. However, the way we use it (data model and concurrency) makes it very unlikely that it would matter in practice.

The patches to smear the leap second have only merged into the Linux kernel last week.

We plan to disable NTP on the 29th, so the leap second won't be communicated to the systems. The individual hosts will run on hardware clock until the 1st, where we'll enable NTP again. A proposal is currently worked out and will be sent around.

@MoritzMuehlenhoff Danke Schoen! Thank you a ton to have taken extra care :-}