Page MenuHomePhabricator

Reboot Analytics hosts for kernel security upgrades
Closed, ResolvedPublic8 Estimated Story Points

Description

Next round of reboots for the roll out of a new kernel version:

  • Hadoop worker nodes - analytics10[28-77]
  • Hadoop master nodes - analytics100[12] (soon to be replaced with analytics-master100[12])
  • Hadoop coordinator - analytics1003
  • AQS nodes (aqs1001-1009)
  • Druid Private nodes (druid1001-3)
  • Druid Public nodes (druid1004-6)
  • Kafka Jumbo
  • Kafka main codfw
  • Kafka main eqiad
  • Kafka Analytics
  • stat100[4-6] hosts
  • notebook100[3,4]
  • conf100[4-6]
  • db110[7,8]
  • eventlog1002
  • dbstore1002 (will not be done since old Trusty, will be replaced soon)

New hosts will follow as soon as the new kernel will be deployed.

Before rebooting please do the following checks to ensure that PXE is not the preferred option (so a reboot will not trigger a reimage):

  • ipmitool -I lanplus -H "HOSTNAME" -U root -E chassis bootparam get 5 | awk '{ FS=":" }; /(Boot parameter data|Boot Device Selector)/{ print $2 }' (should be all zeros - to check)
  • in the mgmt console, racadm get bios.BiosBootSettings.BootSeq that should be BootSeq=HardDisk.List.1-1,NIC.Integrated.1-1-1.

Event Timeline

elukey triaged this task as High priority.Aug 30 2018, 2:43 PM
elukey created this task.

@elukey

Please: is the cluster reboot, planned for September 10 (today), finished?

I really need to run a bunch of Hive jobs (it takes, well, many hours to complete), and I need to make sure these jobs will not be interrupted once run.

Thanks.

@elukey Thanks for the update (e-mail) on the cluster reboot.

Please: you will not be rebooting stat100[4-6] hosts in the following 30 hours or so? Please say you won't. Please.

@elukey Thanks for the update (e-mail) on the cluster reboot.

Please: you will not be rebooting stat100[4-6] hosts in the following 30 hours or so? Please say you won't. Please.

I won't! I was planning to send an email for stat* and notebook* reboots for Thursday EU morning time, but if this impacts your work we can of course choose another time :)

@elukey Thanks!

Thursday EU morning time should be fine with me: I am running a huge update of the Wikidata Concepts Monitor from stat1005.
It should not take more than 10 hours, however, I have to make sure that everything runs smoothly and re-run some ETL procedures if the opposite turns out to be the case.

Mentioned in SAL (#wikimedia-operations) [2018-09-13T06:41:59Z] <elukey> reboot stat100[4-6] for kernel upgrades - T203165

Mentioned in SAL (#wikimedia-operations) [2018-09-13T06:56:11Z] <elukey> reboot notebook100[3,4] for kernel upgrades - T203165

elukey set the point value for this task to 8.
elukey moved this task from In Progress to Done on the Analytics-Kanban board.