Page MenuHomePhabricator

Bring Hadoop workers an-worker11[49-56] into service
Closed, ResolvedPublic

Description

We have 8 new Hadoop workers to bring into service. These were racked in T327295: Q3:rack/setup/install an-worker11[49-56] and purchased in T325206.
For these we are going to:

  • Assign the servers the right partman recipe analytics-flex.cfg
  • Install Debian Bullseye on the hosts.
  • Run the hadoop-init-worker.py cookbook to setup the remaining partitions
  • Create the server's kerberos keytabs.
  • Add the hosts to role(analytics_cluster::hadoop::worker)

Event Timeline

Change 946978 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Correct the role for the new hadoop workers

https://gerrit.wikimedia.org/r/946978

Change 946978 merged by Btullis:

[operations/puppet@production] Correct the role for the new hadoop workers

https://gerrit.wikimedia.org/r/946978

The disks were not mounted on these 8 hosts. I ran a megacli command on all hosts to fix this the situation:

brouberol@cumin1001:~$ sudo cumin 'an-worker11[49-56].eqiad.wmnet' 'sudo megacli -CfgEachDskRaid0 WB RA Direct CachedBadBBU -a0'

an-worker1149.eqiad.wmnet

  • Setup journal node
  • Create kerberos keytabs
  • Commit kerberos keytabs in puppet
  • Run sre.hadoop-init-workers cookbook

an-worker1150.eqiad.wmnet

  • Setup journal node
  • Create kerberos keytabs
  • Commit kerberos keytabs in puppet
  • Run sre.hadoop-init-workers cookbook

an-worker1151.eqiad.wmnet

  • Setup journal node
  • Create kerberos keytabs
  • Commit kerberos keytabs in puppet
  • Run sre.hadoop-init-workers cookbook

an-worker1152.eqiad.wmnet

  • Setup journal node
  • Create kerberos keytabs
  • Commit kerberos keytabs in puppet
  • Run sre.hadoop-init-workers cookbook

an-worker1153.eqiad.wmnet

  • Setup journal node
  • Create kerberos keytabs
  • Commit kerberos keytabs in puppet
  • Run sre.hadoop-init-workers cookbook

an-worker1154eqiad.wmnet

  • Setup journal node
  • Create kerberos keytabs
  • Commit kerberos keytabs in puppet
  • Run sre.hadoop-init-workers cookbook

an-worker1155.eqiad.wmnet

  • Setup journal node
  • Create kerberos keytabs
  • Commit kerberos keytabs in puppet
  • Run sre.hadoop-init-workers cookbook

an-worker1156.eqiad.wmnet

  • Setup journal node
  • Create kerberos keytabs
  • Commit kerberos keytabs in puppet
  • Run sre.hadoop-init-workers cookbook

When everything is done:

  • Add the hosts to role(analytics_cluster::hadoop::worker)

Change 956785 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] Register hadoop workers an-worker-1149->1156.eqiad.wmnet

https://gerrit.wikimedia.org/r/956785

Change 956789 had a related patch set uploaded (by Brouberol; author: Brouberol):

[labs/private@master] Add dummy secrets for all new an-worker11(49->56).eqiad.wmnet hadoop workers

https://gerrit.wikimedia.org/r/956789

~/wmf/private master ❯ mkdir -p modules/secret/secrets/kerberos/keytabs/an-worker11{49,50,51,52,53,54,55,56}.eqiad.wmnet/hadoop
~/wmf/private master ❯ touch modules/secret/secrets/kerberos/keytabs/an-worker11{49,50,51,52,53,54,55,56}.eqiad.wmnet/hadoop/{hdfs,HTTP,yarn}.keytab

Change 956789 merged by Brouberol:

[labs/private@master] Add dummy secrets for all new an-worker11(49->56).eqiad.wmnet hadoop workers

https://gerrit.wikimedia.org/r/956789

Change 956785 merged by Brouberol:

[operations/puppet@production] Register hadoop workers an-worker-1149->1156.eqiad.wmnet

https://gerrit.wikimedia.org/r/956785

brouberol@puppetmaster1001:~$ sudo puppet-merge
...
Brouberol: Register hadoop workers an-worker-1149->1156.eqiad.wmnet (1d71a9e090)
Merge these changes? (yes/no)? yes

Mentioned in SAL (#wikimedia-analytics) [2023-09-12T11:21:25Z] <btullis> demonstrated the use of SAL for T343762

It took 2 puppet runs to get hadoop-hdfs-datanode.service running on an-worker1149. I've set an icinga downtime on the remaining 7 hosts to avoid getting systemd service failure errors reported to us while puppet runs.

brouberol@cumin1001:~$ sudo cookbook sre.hosts.downtime -r 'Mute initial failures of hadoop-hdfs-datanode.service' 'an-worker115[0-6].eqiad.wmnet'

image.png (613×1 px, 121 KB)

All 8 workers have registered into the cluster. We now need to rolling-restart the master nodes to assign them to a non-default rack.

brouberol@cumin1001:~$ sudo cookbook sre.hadoop.roll-restart-masters analytics

All workers are provisioned, registered in Hadoop and have their proper rack assignation. The system should rebalance HDFS blocks automatically.