Page MenuHomePhabricator

Roll restart all openjdk-8 jvms in Analytics
Closed, ResolvedPublic13 Estimated Story Points

Description

A new version of the openjdk-8 package is out, so we need another round of roll restarts of all the Analytics jvm-based daemons. This time we should create a spicerack cookbook for each cluster that can be automated.

  • Hadoop worker nodes
  • Hadoop master nodes
  • Hadoop coordinator
  • Hadoop worker nodes - test cluster
  • Hadoop master nodes - test cluster
  • Hadoop coordinator - test cluster
  • AQS nodes (aqs1004-1009)
  • Druid Private nodes (druid1001-3)
  • Druid Public nodes (druid1004-6)
  • Kafka Jumbo
  • Zookeeper conf100[4-6] and druid100[1-6]
  • /mnt/hdfs fuse mountpoints

Event Timeline

elukey triaged this task as Medium priority.Jul 25 2019, 2:29 PM
elukey created this task.

Change 525520 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/cookbooks@master] Add sre.hadoop.rolling-restart-workers cookbook

https://gerrit.wikimedia.org/r/525520

Change 525520 merged by Elukey:
[operations/cookbooks@master] Add sre.hadoop.rolling-restart-workers cookbook

https://gerrit.wikimedia.org/r/525520

Change 525573 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/cookbooks@master] sre.hadoop.rolling-restart-workers: fix bugs

https://gerrit.wikimedia.org/r/525573

Change 525573 merged by Elukey:
[operations/cookbooks@master] sre.hadoop.rolling-restart-workers: fix bugs

https://gerrit.wikimedia.org/r/525573

Change 525575 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/cookbooks@master] sre.hadoop.rolling-restart-workers.py: fix argparse defaults

https://gerrit.wikimedia.org/r/525575

Change 525575 merged by Elukey:
[operations/cookbooks@master] sre.hadoop.rolling-restart-workers.py: fix argparse defaults

https://gerrit.wikimedia.org/r/525575

Dry run for the hadoop test cluster:

elukey@cumin1001:~$ sudo cookbook -d sre.hadoop.rolling-restart-workers.py test --yarn-nm-batch-size 1 --hdfs-dn-batch-size 1
DRY-RUN: Executing cookbook sre.hadoop.rolling-restart-workers with args: ['test', '--yarn-nm-batch-size', '1', '--hdfs-dn-batch-size', '1']
DRY-RUN: START - Cookbook sre.hadoop.rolling-restart-workers
DRY-RUN: Resolved CNAME record for icinga.wikimedia.org: icinga.wikimedia.org. 300 IN CNAME icinga1001.wikimedia.org.
DRY-RUN: Scheduling downtime on Icinga server icinga1001.wikimedia.org for hosts: analytics[1031-1038,1040].eqiad.wmnet
DRY-RUN: Executing commands ['icinga-downtime -h "analytics1031" -d 7200 -r "Roll restart of jvm daemons for openjdk upgrade. - elukey@cumin1001"', 'icinga-downtime -h "analytics1032" -d 7200 -r "Roll restart of jvm daemons for openjdk upgrade. - elukey@cumin1001"', 'icinga-downtime -h "analytics1033" -d 7200 -r "Roll restart of jvm daemons for openjdk upgrade. - elukey@cumin1001"', 'icinga-downtime -h "analytics1034" -d 7200 -r "Roll restart of jvm daemons for openjdk upgrade. - elukey@cumin1001"', 'icinga-downtime -h "analytics1035" -d 7200 -r "Roll restart of jvm daemons for openjdk upgrade. - elukey@cumin1001"', 'icinga-downtime -h "analytics1036" -d 7200 -r "Roll restart of jvm daemons for openjdk upgrade. - elukey@cumin1001"', 'icinga-downtime -h "analytics1037" -d 7200 -r "Roll restart of jvm daemons for openjdk upgrade. - elukey@cumin1001"', 'icinga-downtime -h "analytics1038" -d 7200 -r "Roll restart of jvm daemons for openjdk upgrade. - elukey@cumin1001"', 'icinga-downtime -h "analytics1040" -d 7200 -r "Roll restart of jvm daemons for openjdk upgrade. - elukey@cumin1001"'] on 1 hosts: icinga1001.wikimedia.org
DRY-RUN: Restarting Yarn Nodemanagers with batch size 1 and sleep 30.0..
DRY-RUN: Executing commands ['systemctl restart hadoop-yarn-nodemanager'] on 9 hosts: analytics[1031-1038,1040].eqiad.wmnet
DRY-RUN: Restarting HDFS Datanodes with batch size 1 and sleep 30.0..
DRY-RUN: Executing commands ['systemctl restart hadoop-hdfs-datanode'] on 9 hosts: analytics[1031-1038,1040].eqiad.wmnet
DRY-RUN: Restarting HDFS Journalnodes with batch size 1 and sleep 30.0..
DRY-RUN: Executing commands ['systemctl restart hadoop-hdfs-journalnode'] on 3 hosts: analytics[1031,1034,1038].eqiad.wmnet
DRY-RUN: Executing commands ['awk \'/^\\s*command_file=/{split($0, a, "="); print a[2] }\' /etc/icinga/icinga.cfg'] on 1 hosts: icinga1001.wikimedia.org
DRY-RUN: Executing commands ['echo -n "[1564071441] DEL_DOWNTIME_BY_HOST_NAME;analytics1031" > /var/lib/icinga/rw/icinga.cmd', 'echo -n "[1564071441] DEL_DOWNTIME_BY_HOST_NAME;analytics1032" > /var/lib/icinga/rw/icinga.cmd', 'echo -n "[1564071441] DEL_DOWNTIME_BY_HOST_NAME;analytics1033" > /var/lib/icinga/rw/icinga.cmd', 'echo -n "[1564071441] DEL_DOWNTIME_BY_HOST_NAME;analytics1034" > /var/lib/icinga/rw/icinga.cmd', 'echo -n "[1564071441] DEL_DOWNTIME_BY_HOST_NAME;analytics1035" > /var/lib/icinga/rw/icinga.cmd', 'echo -n "[1564071441] DEL_DOWNTIME_BY_HOST_NAME;analytics1036" > /var/lib/icinga/rw/icinga.cmd', 'echo -n "[1564071441] DEL_DOWNTIME_BY_HOST_NAME;analytics1037" > /var/lib/icinga/rw/icinga.cmd', 'echo -n "[1564071441] DEL_DOWNTIME_BY_HOST_NAME;analytics1038" > /var/lib/icinga/rw/icinga.cmd', 'echo -n "[1564071441] DEL_DOWNTIME_BY_HOST_NAME;analytics1040" > /var/lib/icinga/rw/icinga.cmd'] on 1 hosts: icinga1001.wikimedia.org
DRY-RUN: All jvm restarts completed!
DRY-RUN: END (PASS) - Cookbook sre.hadoop.rolling-restart-workers (exit_code=0)

@Ottomata --^

Change 525775 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add more cumin aliases for Druid clusters

https://gerrit.wikimedia.org/r/525775

Change 525775 merged by Elukey:
[operations/puppet@production] Add more cumin aliases for Druid clusters

https://gerrit.wikimedia.org/r/525775

Change 525776 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/cookbooks@master] Add sre.druid.roll-restart-workers.py

https://gerrit.wikimedia.org/r/525776

Change 525804 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/cookbooks@master] Add sre.kafka.roll-restart-brokers.py

https://gerrit.wikimedia.org/r/525804

Change 526014 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/cookbooks@master] sre.hadoop.roll-restart-workers.py: increase sleep time for HDFS

https://gerrit.wikimedia.org/r/526014

Change 526014 merged by Elukey:
[operations/cookbooks@master] sre.hadoop.roll-restart-workers.py: increase sleep time for HDFS

https://gerrit.wikimedia.org/r/526014

All 54 Hadoop worker nodes roll restarted with the sre.hadoop.roll-restart-workers.py cookbook!

Change 525776 merged by Elukey:
[operations/cookbooks@master] Add sre.druid.roll-restart-workers.py

https://gerrit.wikimedia.org/r/525776

Change 526111 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/cookbooks@master] sre.hadoop.roll-restart-workers.py: ensure durability of the shell

https://gerrit.wikimedia.org/r/526111

Change 526111 merged by Elukey:
[operations/cookbooks@master] sre.hadoop.roll-restart-workers.py: ensure durability of the shell

https://gerrit.wikimedia.org/r/526111

Change 526122 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::druid::public::worker: add conftool config/scripts

https://gerrit.wikimedia.org/r/526122

Change 526122 merged by Elukey:
[operations/puppet@production] role::druid::public::worker: add conftool config/scripts

https://gerrit.wikimedia.org/r/526122

Change 525804 merged by Elukey:
[operations/cookbooks@master] Add sre.kafka.roll-restart-brokers.py

https://gerrit.wikimedia.org/r/525804

Change 526376 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/cookbooks@master] sre.kafka.roll-restart-brokers.py: improvements to the procedure

https://gerrit.wikimedia.org/r/526376

Change 526376 merged by Elukey:
[operations/cookbooks@master] sre.kafka.roll-restart-brokers.py: improvements to the procedure

https://gerrit.wikimedia.org/r/526376

Change 526472 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/cookbooks@master] sre.kafka.roll-restart-brokers: source /etc/profile.d/kafka.sh

https://gerrit.wikimedia.org/r/526472

Change 526472 merged by Elukey:
[operations/cookbooks@master] sre.kafka.roll-restart-brokers: source /etc/profile.d/kafka.sh

https://gerrit.wikimedia.org/r/526472

Change 526620 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add cumin aliases for Kafka Mirror Maker

https://gerrit.wikimedia.org/r/526620

Change 526624 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/cookbooks@master] Add sre.kafka.roll-restart-mirror-maker.py

https://gerrit.wikimedia.org/r/526624

Change 526628 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/cookbooks@master] Add sre.zookeeper.roll-restart-zookeeper.py

https://gerrit.wikimedia.org/r/526628

Change 526620 merged by Elukey:
[operations/puppet@production] Add cumin aliases for Kafka Mirror Maker and zookeeper

https://gerrit.wikimedia.org/r/526620

Change 526624 merged by Elukey:
[operations/cookbooks@master] Add sre.kafka.roll-restart-mirror-maker.py

https://gerrit.wikimedia.org/r/526624

Change 526628 merged by Elukey:
[operations/cookbooks@master] Add sre.zookeeper.roll-restart-zookeeper.py

https://gerrit.wikimedia.org/r/526628

Change 526665 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/cookbooks@master] Improvements to kafka and zookeeper cookbooks

https://gerrit.wikimedia.org/r/526665

Change 526665 merged by Elukey:
[operations/cookbooks@master] Improvements to kafka and zookeeper cookbooks

https://gerrit.wikimedia.org/r/526665

elukey updated the task description. (Show Details)

Change 528127 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add PartOf configuration in the Kafka mirror systemd units

https://gerrit.wikimedia.org/r/528127

Change 528127 merged by Elukey:
[operations/puppet@production] Add PartOf configuration in the Kafka mirror systemd units

https://gerrit.wikimedia.org/r/528127

elukey changed the point value for this task from 0 to 13.
elukey moved this task from In Progress to Done on the Analytics-Kanban board.