Create a temporary hadoop backup cluster
Closed, ResolvedPublic13 Estimated Story Points
Actions

Assigned To

Authored By

	elukey
	Aug 14 2020, 8:57 AM

Description

Once T260409 is done, we should have a good idea about how much space on HDFS we'd need to backup the data that we care about.

The hardware for the workers + master/standby nodes should come from the refresh of the Hadoop Analytics worker nodes (the SRE team will allow us to keep them around for a little bit more).

Caveat: before using the refreshed nodes, we'd need to decommission the ones that are currently forming the Hadoop test cluster.

Details

Subject	Repo	Branch	Lines +/-
Decommission an-worker1117 from the Hadoop cluster	operations/puppet	production	+8 -0
Add an-worker1119 and 1131 to the Hadoop backup cluster	operations/puppet	production	+1 -6
Add more users to the Hadoop Backup cluster (no ssh access)	operations/puppet	production	+18 -0
Add a more restrictive default umask to Hadoop backup	operations/puppet	production	+1 -0
Add the HDFS balancer to the Master node in Hadoop backup	operations/puppet	production	+2 -0
Add the Hadoop worker profile to master/standby in Backup	operations/puppet	production	+26 -0
profile::prometheus::analytics: add metrics for the Backup cluster	operations/puppet	production	+18 -0
hadoop: make Yarn Spark Shuffle optional	operations/puppet	production	+16 -0
Add roles to the Hadoop Backup cluster nodes	operations/puppet	production	+22 -3
profile::hadoop::worker: make client tools optional	operations/puppet	production	+24 -16
Move hiera config for Hadoop Backup to the correct location	operations/puppet	production	+0 -0
Initial configuration of the Hadoop backup cluster	operations/puppet	production	+338 -0
Add role::analytics_test_cluster::hadoop::ui to an-test-ui1001	operations/puppet	production	+9 -14
sre.hadoop.init-hadoop-workers: add more defensive code	operations/cookbooks	master	+69 -53
Remove analytics1043 from the Hadoop workers	operations/puppet	production	+15 -4

Related Objects
Search...

Status	Assigned	Task
Resolved	JAllemandou	T168554 Default hive table creation to parquet - needs hive 2.3.0
Resolved	elukey	T203498 Upgrade Hive to ≥ 2.0
Resolved	elukey	T203693 Update to CDH 6 or other up-to-date Hadoop distribution
Resolved	elukey	T273711 Upgrade the Analytics Hadoop cluster to Apache Bigtop
Resolved	elukey	T260411 Create a temporary hadoop backup cluster

Event Timeline

elukey created this task.Aug 14 2020, 8:57 AM

elukey moved this task from Backlog to Q1 2021/2022 on the Analytics-Clusters board.

see T260409: Establish what data must be backed up before the HDFS upgrade probably list of dataset to backup should be consolidated to google doc/wiki where we can update it more easily than on a phab ticket

Change 632878 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove an-worker1043 from the Hadoop workers

https://gerrit.wikimedia.org/r/632878

gerritbot added a project: Patch-For-Review.Oct 8 2020, 7:05 AM

Change 632878 merged by Elukey:
[operations/puppet@production] Remove analytics1043 from the Hadoop workers

https://gerrit.wikimedia.org/r/632878

Maintenance_bot removed a project: Patch-For-Review.Oct 8 2020, 8:10 AM

elukey merged a task: T263814: Create temporary cluster to hold a copy of data for backup purposes.Oct 12 2020, 9:34 AM

elukey added subscribers: Ottomata, JAllemandou.

Change 635751 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Initial configuration of the Hadoop backup cluster

https://gerrit.wikimedia.org/r/635751

gerritbot added a project: Patch-For-Review.Oct 22 2020, 8:38 AM

Change 636403 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/cookbooks@master] sre.hadoop.init-hadoop-workers: add more defensive code

https://gerrit.wikimedia.org/r/636403

Change 636403 merged by Elukey:
[operations/cookbooks@master] sre.hadoop.init-hadoop-workers: add more defensive code

https://gerrit.wikimedia.org/r/636403

All the nodes (analytics1042 -> 1057) have new ext4 partitions for /var/lib/hadoop/data/$letter.

Next steps:

Reimage all the nodes (keeping Debian Stretch)
Review https://gerrit.wikimedia.org/r/635751
Create a follow up patch to deploy roles to hosts
Bootstrap the cluster

@razzi is already doing 2), then we'll do together the rest :)

elukey reassigned this task from elukey to • razzi.Oct 27 2020, 1:21 PM

elukey triaged this task as High priority.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['analytics1042.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202010280849_elukey_19135.log.

Completed auto-reimage of hosts:

['analytics1042.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['analytics1043.eqiad.wmnet', 'analytics1044.eqiad.wmnet', 'analytics1045.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202010280937_elukey_1633.log.

Completed auto-reimage of hosts:

['analytics1045.eqiad.wmnet', 'analytics1043.eqiad.wmnet']

Of which those FAILED:

['analytics1044.eqiad.wmnet']

analytics1044 seems to keep PXE booting, so it installs endlessly the OS. I checked the system's setup (reboot + f2) but the hard disk step is configured before the NIC (as expected), so not sure what's wrong.

Script wmf-auto-reimage was launched by razzi on cumin1001.eqiad.wmnet for hosts:

['analytics1046.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202010291915_razzi_1269.log.

Completed auto-reimage of hosts:

['analytics1046.eqiad.wmnet']

Of which those FAILED:

['analytics1046.eqiad.wmnet']

Script wmf-auto-reimage was launched by razzi on cumin1001.eqiad.wmnet for hosts:

['analytics1047.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202010292155_razzi_27451.log.

Completed auto-reimage of hosts:

['analytics1047.eqiad.wmnet']

and were ALL successful.

Change 637607 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add role::analytics_test_cluster::hadoop::ui to an-test-ui1001

https://gerrit.wikimedia.org/r/637607

Change 637607 merged by Elukey:
[operations/puppet@production] Add role::analytics_test_cluster::hadoop::ui to an-test-ui1001

https://gerrit.wikimedia.org/r/637607

analytics1044 is in an endless PXE install loop, and it is not due to NIC-before-hdd (already checked), but because for some reason /dev/sda is not one the raid1 in the flexbay (~250GB) but one of the 12 datanode disks (4TB each):

~ # dmesg | grep sda
[   51.285328] sd 0:0:2:0: [sda] 7814037168 512-byte logical blocks: (4.00 TB/3.64 TiB)
[   51.367087] sd 0:0:2:0: [sda] Write Protect is off
[   51.367092] sd 0:0:2:0: [sda] Mode Sense: 9b 00 10 08
[   51.367792] sd 0:0:2:0: [sda] Write cache: enabled, read cache: enabled, supports DPO and FUA
[   51.458325]  sda: sda1 sda2
[   51.534583] sd 0:0:2:0: [sda] Attached SCSI disk
[  104.306173] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
~ # dmesg | grep sdb
[   51.284629] sd 0:2:0:0: [sdb] 487325696 512-byte logical blocks: (250 GB/232 GiB)
[   51.284697] sd 0:2:0:0: [sdb] Write Protect is off
[   51.284701] sd 0:2:0:0: [sdb] Mode Sense: 1f 00 00 08
[   51.284747] sd 0:2:0:0: [sdb] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[   51.323847]  sdb:
[   51.324164] sd 0:2:0:0: [sdb] Attached SCSI disk

So the OS gets installed on a datanode disk, and then when it comes to boot there is nothing in the raid1 flexbay disk so PXE gets selected. It seems an issue with the d-i logic itself, really strange.

analytics1046 doesn't pass the first checks of the boot, seems really stuck, no idea if it is very dead (like motherboard out) or if there is anything that we can do to unblock it.

I was able to fix 1044, it was a problem with one broken disk not configured properly in the DELL raid controller setup. 1046 is still not working :(

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['analytics1048.eqiad.wmnet', 'analytics1049.eqiad.wmnet', 'analytics1050.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202011060900_elukey_20692.log.

Completed auto-reimage of hosts:

['analytics1048.eqiad.wmnet']

Of which those FAILED:

['analytics1050.eqiad.wmnet', 'analytics1049.eqiad.wmnet']

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['analytics1049.eqiad.wmnet', 'analytics1050.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202011061035_elukey_18556.log.

Opened https://phabricator.wikimedia.org/T267392 for analytics1046

Completed auto-reimage of hosts:

['analytics1049.eqiad.wmnet', 'analytics1050.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['analytics1052.eqiad.wmnet', 'analytics1051.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202011061106_elukey_18534.log.

Completed auto-reimage of hosts:

['analytics1051.eqiad.wmnet', 'analytics1052.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['analytics1053.eqiad.wmnet', 'analytics1054.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202011061136_elukey_17365.log.

Completed auto-reimage of hosts:

['analytics1054.eqiad.wmnet', 'analytics1053.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['analytics1055.eqiad.wmnet', 'analytics1056.eqiad.wmnet', 'analytics1057.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202011061349_elukey_14427.log.

Completed auto-reimage of hosts:

['analytics1057.eqiad.wmnet']

Of which those FAILED:

['analytics1057.eqiad.wmnet']

After a full round of reimages only 1046 and 1057 are not available, since they don't boot anymore. Let's see if dcops can help in https://phabricator.wikimedia.org/T267392

Sadly 1046 and 1057 need to be decommissioned. At this point, with 14 "old" workers remaining (not sufficient for our purposes), I think it is better to just decom all of them (to free space in the DC) and create the backup cluster from the new hadoop worker nodes only (more reliability, less issues, etc..).

@Ottomata @razzi ok with the plan?

Ottomata reassigned this task from • razzi to elukey.Nov 30 2020, 4:56 PM

Ottomata moved this task from Q1 2021/2022 to Q3 2020/2021 on the Analytics-Clusters board.Dec 14 2020, 4:45 PM

Change 635751 merged by Elukey:
[operations/puppet@production] Initial configuration of the Hadoop backup cluster

https://gerrit.wikimedia.org/r/635751

Maintenance_bot removed a project: Patch-For-Review.Jan 22 2021, 8:10 AM

Change 657769 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add roles to the Hadoop Backup cluster nodes

https://gerrit.wikimedia.org/r/657769

gerritbot added a project: Patch-For-Review.Jan 22 2021, 8:16 AM

Change 657774 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Move hiera config for Hadoop Backup to the correct location

https://gerrit.wikimedia.org/r/657774

Change 657774 merged by Elukey:
[operations/puppet@production] Move hiera config for Hadoop Backup to the correct location

https://gerrit.wikimedia.org/r/657774

Change 657784 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::hadoop::worker: make client tools optional

https://gerrit.wikimedia.org/r/657784

Change 657784 merged by Elukey:
[operations/puppet@production] profile::hadoop::worker: make client tools optional

https://gerrit.wikimedia.org/r/657784

Change 657769 merged by Elukey:
[operations/puppet@production] Add roles to the Hadoop Backup cluster nodes

https://gerrit.wikimedia.org/r/657769

Change 657805 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] hadoop: make Yarn Spark Shuffle optional

https://gerrit.wikimedia.org/r/657805

Change 657805 merged by Elukey:
[operations/puppet@production] hadoop: make Yarn Spark Shuffle optional

https://gerrit.wikimedia.org/r/657805

Change 657810 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::prometheus::analytics: add metrics for the Backup cluster

https://gerrit.wikimedia.org/r/657810

Change 657810 merged by Elukey:
[operations/puppet@production] profile::prometheus::analytics: add metrics for the Backup cluster

https://gerrit.wikimedia.org/r/657810

The cluster is up and running, together with metrics etc..

The current set up is:

two master nodes (an-worker1118 and an-worker1124)
14 worker nodes, for a total of 560TB of free space
2 worker nodes pending dcops will be added (+96TB)

This is still not enough for our needs (400TBx2replicas = 800TB), so we'll either need to add a datanode daemon on the masters (+96TB) or to ask dcops to rack other nodes :(

Change 658098 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add the Hadoop worker profile to master/standby in Backup

https://gerrit.wikimedia.org/r/658098

Change 658098 merged by Elukey:
[operations/puppet@production] Add the Hadoop worker profile to master/standby in Backup

https://gerrit.wikimedia.org/r/658098

Change 658215 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add the HDFS balancer to the Master node in Hadoop backup

https://gerrit.wikimedia.org/r/658215

Change 658215 merged by Elukey:
[operations/puppet@production] Add the HDFS balancer to the Master node in Hadoop backup

https://gerrit.wikimedia.org/r/658215

Change 658219 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add a more restrictive default umask to Hadoop backup

https://gerrit.wikimedia.org/r/658219

Change 658219 merged by Elukey:
[operations/puppet@production] Add a more restrictive default umask to Hadoop backup

https://gerrit.wikimedia.org/r/658219

Change 658394 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add more users to the Hadoop Backup cluster (no ssh access)

https://gerrit.wikimedia.org/r/658394

Change 658394 merged by Elukey:
[operations/puppet@production] Add more users to the Hadoop Backup cluster (no ssh access)

https://gerrit.wikimedia.org/r/658394

Change 658553 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add an-worker1119 and 1131 to the Hadoop backup cluster

https://gerrit.wikimedia.org/r/658553

Change 658553 merged by Elukey:
[operations/puppet@production] Add an-worker1119 and 1131 to the Hadoop backup cluster

https://gerrit.wikimedia.org/r/658553

Last two nodes added to the cluster, in theory it is ready to go. Will leave this task open for a bit in case something is needed :)

Change 661051 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Decommission an-worker1117 from the Hadoop cluster

https://gerrit.wikimedia.org/r/661051

Change 661051 merged by Elukey:
[operations/puppet@production] Decommission an-worker1117 from the Hadoop cluster

https://gerrit.wikimedia.org/r/661051

elukey set the point value for this task to 13.Feb 11 2021, 11:12 AM

elukey added a project: Analytics-Kanban.

elukey moved this task from Next Up to Done on the Analytics-Kanban board.

nshahquinn-wmf added a parent task: T273711: Upgrade the Analytics Hadoop cluster to Apache Bigtop.Feb 11 2021, 2:24 PM

nshahquinn-wmf removed a parent task: T255142: Upgrade the Hadoop Analytics cluster to BigTop.

Ottomata moved this task from Q3 2020/2021 to Done on the Analytics-Clusters board.Mar 15 2021, 3:49 PM

• fdans closed this task as Resolved.Mar 18 2021, 4:02 PM

elukey mentioned this in rCCKB0fe6d422d56b: sre.hadoop.init-hadoop-workers: add more defensive code.Dec 14 2022, 3:27 PM