Page MenuHomePhabricator

Restart Analytics JVM daemons for open-jdk security updates
Closed, ResolvedPublic13 Estimated Story Points

Description

Daemons do restart:

  • Kafka on kafka-jumbo
  • Kafka on kafka10[12-22] (Analytics) - stopped at kafka1018, hw issue - https://phabricator.wikimedia.org/T181518
  • Kafka on kafka100[123] (Main Eqiad)
  • Kafka on kafka200[123] (Main Codfw)
  • Cassandra on aqs100[4-9] (together with reboot for kernel updates)
  • Hadoop HDFS Journal/Data nodes and Yarn NodeManager on analytics10[28-69]
  • Hadoop HDFS Namenode and Yarn Resource Manager (plus other minor daemons) on analytics100[12]
  • Hive Server/Database and Oozie on analytics1003
  • Druid-* on druid100[1-3]
  • Druid-* on druid100[4-6]
  • Zookeeper on Druid*
  • Zookeeper on Conf*

Event Timeline

Note that the hadoop clusters and kafka* are running Java 7 and there hasn't been an openjdk-7 release yet (so also no update in Debian), so at this point only kafka-jumbo (which runs stretch/java8) and aqs/cassandra need an update.

Yep, I was planning to keep this open until the new updates will arrive for jdk 7, but if this is too far in the future I'll open a new phab task later on and limit the scope of this one :)

Mentioned in SAL (#wikimedia-analytics) [2017-11-08T10:04:43Z] <elukey> suspended cassandra-coord-pageview-per-project-hourly as prep step to reboot aqs nodes - T179943

Currently waiting for the jdk7 updates for jessie.

elukey changed the task status from Open to Stalled.Nov 9 2017, 12:42 PM
elukey moved this task from Backlog to Stalled on the User-Elukey board.

Mentioned in SAL (#wikimedia-operations) [2017-11-28T14:03:11Z] <elukey> reboot kafka200[123] for kernel + jvm updates - T179943

Mentioned in SAL (#wikimedia-operations) [2017-11-28T14:17:34Z] <elukey> reboot kafka10[12-22] for kernel + jvm updates - T179943

Mentioned in SAL (#wikimedia-operations) [2017-11-29T11:30:10Z] <elukey> reboot kafka1001 for kernel + jvm updates - T179943

elukey changed the task status from Stalled to Open.Nov 29 2017, 12:05 PM
elukey updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2017-11-29T13:18:13Z] <elukey> reboot kafka100[23] for jvm+kernel updates - T179943

Mentioned in SAL (#wikimedia-operations) [2017-11-29T14:36:25Z] <elukey> reboot druid100[456] for jvm+kernel updates - T179943

Mentioned in SAL (#wikimedia-operations) [2017-11-30T16:12:17Z] <elukey> drain and reboot analytics1031->39 to pick up jvm+kernel updates - T179943

Mentioned in SAL (#wikimedia-operations) [2017-11-30T21:23:00Z] <mutante> powercycling kafka1018 (was down in Icinga and saw in SAL: reboot kafka10[12-22] for kernel + jvm updates - T179943)

Mentioned in SAL (#wikimedia-operations) [2017-12-01T08:40:57Z] <elukey> reboot the remaining analytics103* hadoop workers to pick up kernel+jvm updates - T179943

Mentioned in SAL (#wikimedia-operations) [2017-12-01T09:23:46Z] <elukey> reboot analytics104* for kernel+jvm updates - T179943

Mentioned in SAL (#wikimedia-operations) [2017-12-01T10:57:21Z] <elukey> reboot analytics1028 for kernel + jvm updates (Hadoop HDFS journalnode) - T179943

Mentioned in SAL (#wikimedia-operations) [2017-12-01T12:44:08Z] <elukey> reboot druid1001 for kernel+jvm updates - T179943

Mentioned in SAL (#wikimedia-operations) [2017-12-04T09:24:26Z] <elukey> reboot analytics104* (hadoop worker nodes) for kernel+jvm updates - T179943

Mentioned in SAL (#wikimedia-operations) [2017-12-04T09:24:26Z] <elukey> reboot analytics104* (hadoop worker nodes) for kernel+jvm updates - T179943

This one should be related to 105*, 104* already done!

Mentioned in SAL (#wikimedia-operations) [2017-12-04T14:01:50Z] <elukey> reboot analytics106* (hadoop worker nodes) for kernel+jvm updates - T179943

Mentioned in SAL (#wikimedia-operations) [2017-12-05T09:42:32Z] <elukey> reboot analytics100[12] for kernel+jvm updates (Hadoop Master nodes) - T179943

Mentioned in SAL (#wikimedia-operations) [2017-12-05T10:45:53Z] <elukey> reboot druid1003 for kernel+jvm updates - T179943

Mentioned in SAL (#wikimedia-operations) [2017-12-07T10:12:25Z] <elukey> reboot analytics1003 for kernel+jvm updates - T179943

The last reboot (analytics1003) was particularly painful due to several issues happening in a row.

Timeline of events in UTC:

  • [10:12] Reboot of analytics1003 after draining all the Hadoop jobs.
  • [10:20] Long systemd boot time due to the systemd-tmpfiles-setup.service, but it eventually succeeded (13 seconds kernel startup and 6:22 mins userspace). All daemons up and running, didn't notice anything weird while checking the host.
  • [10:42] Alarms fired for Hive not available (both server and metastore). This was probably the Icinga downtime expiring. At the same time, any ssh session to analytics1003 were immediately terminated with System is booting up. See pam_nologin(8), so I was not able to check what was happening. I tried to connect to the physical serial console (remotely) but I wasn't able to enter (Moritz was still holding a session and I was convinced that the serial console would return an error message rather than simply nothing when already used).
  • [10:50] analytics1003 powercycled, booted correctly (kernel time 14s, 55s user space, this time no systemd slowdowns).
  • [10:51] Hive not available as reported by Joseph, every clients were getting "connection refused" while trying to connect to port 10000 of the Hive server (the one to which queries are supposed to go). Netstat on analytics1003 showed that no port 10000 was bound by any process.
  • [11:50] Removed the new Prometheus javagent configs set up for T177458 (already running on all the Java daemons for the Hadoop clusters without any issue) and Hive was able to bind properly its client port.

Several things slowed me down (mostly my fault but reporting for completeness):

  • The Hive daemons (server and metastore) were up and not reporting weird messages in their logs. The Prometheus javagent were successfully exposing metrics.
  • Tried for a bit to figure out in puppet where Hive port 10000 is set without much success, I initially thought that it was a problem with my recent puppet refactoring.
  • systemctl restart (and event service restart/stop) were not able to kill running daemons for some reason, ending up in weird errors for trying to bind the same port multiple times.

Worth to mention, found this while investigating:

elukey@analytics1003:/var/log/mylvmbackup$ less analytics-meta.log
[...]
Can't locate File/Copy/Recursive.pm in @INC (you may need to install the File::Copy::Recursive module) (@INC contains: /etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.20.2 /usr/local/share/perl/5.20.2 /usr/lib/x86_64-linux-gnu/perl5/5.20 /usr/share/perl5 /usr/lib/x86_64-linux-gnu/perl/5.20 /usr/share/perl/5.20 /usr/local/lib/site_perl .) at /usr/bin/mylvmbackup line 25.
BEGIN failed--compilation aborted at /usr/bin/mylvmbackup line 25.
[..]
root@analytics1002:/srv/backup/mysql/analytics-meta/backup# ls -lht |head
total 16G
-rw-r----- 1 root root  50M Sep 22  2016 ib_logfile0
-rw-r----- 1 root root  48M Sep 22  2016 analytics-meta-bin.004648
-rw-r----- 1 root root  640 Sep 22  2016 analytics-meta-bin.004649
-rw-r----- 1 root root 4.7K Sep 22  2016 analytics-meta-bin.index
-rw-r----- 1 root root 140M Sep 22  2016 ibdata1
-rw-r----- 1 root root  50M Sep 22  2016 ib_logfile1
-rw-r----- 1 root root  50M Sep 22  2016 analytics-meta-bin.004647
-rw-r----- 1 root root  45M Sep 22  2016 analytics-meta-bin.004646
-rw-r----- 1 root root  52M Sep 22  2016 analytics-meta-bin.004645

Mentioned in SAL (#wikimedia-operations) [2017-12-19T10:47:29Z] <elukey> restart zookeeper on conf2001 for jvm updates - T179943

Mentioned in SAL (#wikimedia-operations) [2017-12-19T10:53:45Z] <elukey> reboot conf2001 for kernel updates - T179943

Mentioned in SAL (#wikimedia-operations) [2017-12-20T17:32:29Z] <elukey> restart zookeeper on conf2002 for jvm updates - T179943

Mentioned in SAL (#wikimedia-operations) [2017-12-20T17:43:17Z] <elukey> restart zookeeper on conf2003 for jvm updates - T179943

Mentioned in SAL (#wikimedia-operations) [2017-12-21T09:20:21Z] <elukey> restart zookeeper on conf1001 for jvm updates - T179943

Mentioned in SAL (#wikimedia-operations) [2017-12-21T09:30:39Z] <elukey> restart zookeeper on conf100[2,3] for jvm updates - T179943

The remaining hosts to reboot/restart-jvm are the kafka102[0,2] brokers, but given what happened with kafka1018 (hw failure and OOW) we'll complete the work in January to avoid the risk of troubles during holidays :)

We'll have to do another round of reboots probably next week, so the remaining kafka hosts will be done later on.

elukey set the point value for this task to 13.Jan 3 2018, 3:15 PM
elukey moved this task from Paused to Done on the Analytics-Kanban board.