Description

We have 8 new Hadoop workers to bring into service. These were racked in T327295: Q3:rack/setup/install an-worker11[49-56] and purchased in T325206.
For these we are going to:

Assign the servers the right partman recipe analytics-flex.cfg
Install Debian Bullseye on the hosts.
Run the hadoop-init-worker.py cookbook to setup the remaining partitions
Create the server's kerberos keytabs.
Add the hosts to role(analytics_cluster::hadoop::worker)

Subject	Repo	Branch	Lines +/-
Register hadoop workers an-worker-1149->1156.eqiad.wmnet	operations/puppet	production	+12 -8
Add dummy secrets for all new an-worker11(49->56).eqiad.wmnet hadoop workers	labs/private	master	+0 -0
Correct the role for the new hadoop workers	operations/puppet	production	+1 -1

Event Timeline

BTullis created this task.Aug 7 2023, 9:51 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 7 2023, 9:51 PM

BTullis mentioned this in T343763: Decommission analytics10[70-77].Aug 7 2023, 9:53 PM

Change 946978 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Correct the role for the new hadoop workers

https://gerrit.wikimedia.org/r/946978

gerritbot added a project: Patch-For-Review.Aug 8 2023, 5:05 PM

Change 946978 merged by Btullis:

[operations/puppet@production] Correct the role for the new hadoop workers

https://gerrit.wikimedia.org/r/946978

Maintenance_bot removed a project: Patch-For-Review.Aug 8 2023, 5:30 PM

Gehel moved this task from Incoming to Misc on the Data-Platform-SRE board.Aug 18 2023, 8:33 AM

BTullis moved this task from Misc to Ready for Work on the Data-Platform-SRE board.Aug 22 2023, 3:39 PM

Gehel moved this task from Ready for Work to In Progress on the Data-Platform-SRE board.Sep 1 2023, 8:37 AM

Stevemunene updated the task description. (Show Details)Sep 1 2023, 12:49 PM

brouberol claimed this task.Sep 11 2023, 3:02 PM

The disks were not mounted on these 8 hosts. I ran a megacli command on all hosts to fix this the situation:

brouberol@cumin1001:~$ sudo cumin 'an-worker11[49-56].eqiad.wmnet' 'sudo megacli -CfgEachDskRaid0 WB RA Direct CachedBadBBU -a0'

brouberol updated the task description. (Show Details)Sep 12 2023, 7:59 AM

Change 956785 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] Register hadoop workers an-worker-1149->1156.eqiad.wmnet

https://gerrit.wikimedia.org/r/956785

gerritbot added a project: Patch-For-Review.Sep 12 2023, 8:15 AM

brouberol updated the task description. (Show Details)Sep 12 2023, 8:15 AM

Change 956789 had a related patch set uploaded (by Brouberol; author: Brouberol):

[labs/private@master] Add dummy secrets for all new an-worker11(49->56).eqiad.wmnet hadoop workers

https://gerrit.wikimedia.org/r/956789

~/wmf/private master ❯ mkdir -p modules/secret/secrets/kerberos/keytabs/an-worker11{49,50,51,52,53,54,55,56}.eqiad.wmnet/hadoop
~/wmf/private master ❯ touch modules/secret/secrets/kerberos/keytabs/an-worker11{49,50,51,52,53,54,55,56}.eqiad.wmnet/hadoop/{hdfs,HTTP,yarn}.keytab

Change 956789 merged by Brouberol:

[labs/private@master] Add dummy secrets for all new an-worker11(49->56).eqiad.wmnet hadoop workers

https://gerrit.wikimedia.org/r/956789

brouberol mentioned this in rLPRI4405d64b6e4c: Add dummy secrets for all new an-worker11(49->56).eqiad.wmnet hadoop workers.Sep 12 2023, 9:58 AM

Change 956785 merged by Brouberol:

[operations/puppet@production] Register hadoop workers an-worker-1149->1156.eqiad.wmnet

https://gerrit.wikimedia.org/r/956785

brouberol@puppetmaster1001:~$ sudo puppet-merge
...
Brouberol: Register hadoop workers an-worker-1149->1156.eqiad.wmnet (1d71a9e090)
Merge these changes? (yes/no)? yes

Mentioned in SAL (#wikimedia-analytics) [2023-09-12T11:21:25Z] <btullis> demonstrated the use of SAL for T343762

Maintenance_bot removed a project: Patch-For-Review.Sep 12 2023, 11:30 AM

It took 2 puppet runs to get hadoop-hdfs-datanode.service running on an-worker1149. I've set an icinga downtime on the remaining 7 hosts to avoid getting systemd service failure errors reported to us while puppet runs.

brouberol@cumin1001:~$ sudo cookbook sre.hosts.downtime -r 'Mute initial failures of hadoop-hdfs-datanode.service' 'an-worker115[0-6].eqiad.wmnet'

All 8 workers have registered into the cluster. We now need to rolling-restart the master nodes to assign them to a non-default rack.

brouberol@cumin1001:~$ sudo cookbook sre.hadoop.roll-restart-masters analytics

All workers are provisioned, registered in Hadoop and have their proper rack assignation. The system should rebalance HDFS blocks automatically.

brouberol moved this task from In Progress to Done on the Data-Platform-SRE board.Nov 6 2023, 1:36 PM

BTullis mentioned this in T353776: Bring an-worker11[57-75] into service.Jan 30 2024, 12:32 PM

Bring Hadoop workers an-worker11[49-56] into service
Closed, ResolvedPublic
Actions

Description

Details

Related Objects

Event Timeline

	F37710141: image.png
	Sep 12 2023, 12:10 PM

Bring Hadoop workers an-worker11[49-56] into serviceClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Bring Hadoop workers an-worker11[49-56] into service
Closed, ResolvedPublic
Actions