Page MenuHomePhabricator

Create a temporary hadoop backup cluster
Closed, ResolvedPublic13 Estimated Story Points

Description

Once T260409 is done, we should have a good idea about how much space on HDFS we'd need to backup the data that we care about.

The hardware for the workers + master/standby nodes should come from the refresh of the Hadoop Analytics worker nodes (the SRE team will allow us to keep them around for a little bit more).

Caveat: before using the refreshed nodes, we'd need to decommission the ones that are currently forming the Hadoop test cluster.

Event Timeline

see T260409: Establish what data must be backed up before the HDFS upgrade probably list of dataset to backup should be consolidated to google doc/wiki where we can update it more easily than on a phab ticket

Change 632878 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove an-worker1043 from the Hadoop workers

https://gerrit.wikimedia.org/r/632878

Change 632878 merged by Elukey:
[operations/puppet@production] Remove analytics1043 from the Hadoop workers

https://gerrit.wikimedia.org/r/632878

Change 635751 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Initial configuration of the Hadoop backup cluster

https://gerrit.wikimedia.org/r/635751

Change 636403 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/cookbooks@master] sre.hadoop.init-hadoop-workers: add more defensive code

https://gerrit.wikimedia.org/r/636403

Change 636403 merged by Elukey:
[operations/cookbooks@master] sre.hadoop.init-hadoop-workers: add more defensive code

https://gerrit.wikimedia.org/r/636403

All the nodes (analytics1042 -> 1057) have new ext4 partitions for /var/lib/hadoop/data/$letter.

Next steps:

  1. Reimage all the nodes (keeping Debian Stretch)
  2. Review https://gerrit.wikimedia.org/r/635751
  3. Create a follow up patch to deploy roles to hosts
  4. Bootstrap the cluster

@razzi is already doing 2), then we'll do together the rest :)

elukey triaged this task as High priority.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['analytics1042.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202010280849_elukey_19135.log.

Completed auto-reimage of hosts:

['analytics1042.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['analytics1043.eqiad.wmnet', 'analytics1044.eqiad.wmnet', 'analytics1045.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202010280937_elukey_1633.log.

Completed auto-reimage of hosts:

['analytics1045.eqiad.wmnet', 'analytics1043.eqiad.wmnet']

Of which those FAILED:

['analytics1044.eqiad.wmnet']

analytics1044 seems to keep PXE booting, so it installs endlessly the OS. I checked the system's setup (reboot + f2) but the hard disk step is configured before the NIC (as expected), so not sure what's wrong.

Script wmf-auto-reimage was launched by razzi on cumin1001.eqiad.wmnet for hosts:

['analytics1046.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202010291915_razzi_1269.log.

Completed auto-reimage of hosts:

['analytics1046.eqiad.wmnet']

Of which those FAILED:

['analytics1046.eqiad.wmnet']

Script wmf-auto-reimage was launched by razzi on cumin1001.eqiad.wmnet for hosts:

['analytics1047.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202010292155_razzi_27451.log.

Completed auto-reimage of hosts:

['analytics1047.eqiad.wmnet']

and were ALL successful.

Change 637607 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add role::analytics_test_cluster::hadoop::ui to an-test-ui1001

https://gerrit.wikimedia.org/r/637607

Change 637607 merged by Elukey:
[operations/puppet@production] Add role::analytics_test_cluster::hadoop::ui to an-test-ui1001

https://gerrit.wikimedia.org/r/637607

analytics1044 is in an endless PXE install loop, and it is not due to NIC-before-hdd (already checked), but because for some reason /dev/sda is not one the raid1 in the flexbay (~250GB) but one of the 12 datanode disks (4TB each):

~ # dmesg | grep sda
[   51.285328] sd 0:0:2:0: [sda] 7814037168 512-byte logical blocks: (4.00 TB/3.64 TiB)
[   51.367087] sd 0:0:2:0: [sda] Write Protect is off
[   51.367092] sd 0:0:2:0: [sda] Mode Sense: 9b 00 10 08
[   51.367792] sd 0:0:2:0: [sda] Write cache: enabled, read cache: enabled, supports DPO and FUA
[   51.458325]  sda: sda1 sda2
[   51.534583] sd 0:0:2:0: [sda] Attached SCSI disk
[  104.306173] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
~ # dmesg | grep sdb
[   51.284629] sd 0:2:0:0: [sdb] 487325696 512-byte logical blocks: (250 GB/232 GiB)
[   51.284697] sd 0:2:0:0: [sdb] Write Protect is off
[   51.284701] sd 0:2:0:0: [sdb] Mode Sense: 1f 00 00 08
[   51.284747] sd 0:2:0:0: [sdb] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[   51.323847]  sdb:
[   51.324164] sd 0:2:0:0: [sdb] Attached SCSI disk

So the OS gets installed on a datanode disk, and then when it comes to boot there is nothing in the raid1 flexbay disk so PXE gets selected. It seems an issue with the d-i logic itself, really strange.

analytics1046 doesn't pass the first checks of the boot, seems really stuck, no idea if it is very dead (like motherboard out) or if there is anything that we can do to unblock it.

I was able to fix 1044, it was a problem with one broken disk not configured properly in the DELL raid controller setup. 1046 is still not working :(

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['analytics1048.eqiad.wmnet', 'analytics1049.eqiad.wmnet', 'analytics1050.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202011060900_elukey_20692.log.

Completed auto-reimage of hosts:

['analytics1048.eqiad.wmnet']

Of which those FAILED:

['analytics1050.eqiad.wmnet', 'analytics1049.eqiad.wmnet']

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['analytics1049.eqiad.wmnet', 'analytics1050.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202011061035_elukey_18556.log.

Completed auto-reimage of hosts:

['analytics1049.eqiad.wmnet', 'analytics1050.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['analytics1052.eqiad.wmnet', 'analytics1051.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202011061106_elukey_18534.log.

Completed auto-reimage of hosts:

['analytics1051.eqiad.wmnet', 'analytics1052.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['analytics1053.eqiad.wmnet', 'analytics1054.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202011061136_elukey_17365.log.

Completed auto-reimage of hosts:

['analytics1054.eqiad.wmnet', 'analytics1053.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['analytics1055.eqiad.wmnet', 'analytics1056.eqiad.wmnet', 'analytics1057.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202011061349_elukey_14427.log.

Completed auto-reimage of hosts:

['analytics1057.eqiad.wmnet']

Of which those FAILED:

['analytics1057.eqiad.wmnet']

After a full round of reimages only 1046 and 1057 are not available, since they don't boot anymore. Let's see if dcops can help in https://phabricator.wikimedia.org/T267392

Sadly 1046 and 1057 need to be decommissioned. At this point, with 14 "old" workers remaining (not sufficient for our purposes), I think it is better to just decom all of them (to free space in the DC) and create the backup cluster from the new hadoop worker nodes only (more reliability, less issues, etc..).

@Ottomata @razzi ok with the plan?

Change 635751 merged by Elukey:
[operations/puppet@production] Initial configuration of the Hadoop backup cluster

https://gerrit.wikimedia.org/r/635751

Change 657769 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add roles to the Hadoop Backup cluster nodes

https://gerrit.wikimedia.org/r/657769

Change 657774 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Move hiera config for Hadoop Backup to the correct location

https://gerrit.wikimedia.org/r/657774

Change 657774 merged by Elukey:
[operations/puppet@production] Move hiera config for Hadoop Backup to the correct location

https://gerrit.wikimedia.org/r/657774

Change 657784 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::hadoop::worker: make client tools optional

https://gerrit.wikimedia.org/r/657784

Change 657784 merged by Elukey:
[operations/puppet@production] profile::hadoop::worker: make client tools optional

https://gerrit.wikimedia.org/r/657784

Change 657769 merged by Elukey:
[operations/puppet@production] Add roles to the Hadoop Backup cluster nodes

https://gerrit.wikimedia.org/r/657769

Change 657805 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] hadoop: make Yarn Spark Shuffle optional

https://gerrit.wikimedia.org/r/657805

Change 657805 merged by Elukey:
[operations/puppet@production] hadoop: make Yarn Spark Shuffle optional

https://gerrit.wikimedia.org/r/657805

Change 657810 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::prometheus::analytics: add metrics for the Backup cluster

https://gerrit.wikimedia.org/r/657810

Change 657810 merged by Elukey:
[operations/puppet@production] profile::prometheus::analytics: add metrics for the Backup cluster

https://gerrit.wikimedia.org/r/657810

The cluster is up and running, together with metrics etc..

The current set up is:

  • two master nodes (an-worker1118 and an-worker1124)
  • 14 worker nodes, for a total of 560TB of free space
  • 2 worker nodes pending dcops will be added (+96TB)

This is still not enough for our needs (400TBx2replicas = 800TB), so we'll either need to add a datanode daemon on the masters (+96TB) or to ask dcops to rack other nodes :(

Change 658098 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add the Hadoop worker profile to master/standby in Backup

https://gerrit.wikimedia.org/r/658098

Change 658098 merged by Elukey:
[operations/puppet@production] Add the Hadoop worker profile to master/standby in Backup

https://gerrit.wikimedia.org/r/658098

Change 658215 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add the HDFS balancer to the Master node in Hadoop backup

https://gerrit.wikimedia.org/r/658215

Change 658215 merged by Elukey:
[operations/puppet@production] Add the HDFS balancer to the Master node in Hadoop backup

https://gerrit.wikimedia.org/r/658215

Change 658219 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add a more restrictive default umask to Hadoop backup

https://gerrit.wikimedia.org/r/658219

Change 658219 merged by Elukey:
[operations/puppet@production] Add a more restrictive default umask to Hadoop backup

https://gerrit.wikimedia.org/r/658219

Change 658394 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add more users to the Hadoop Backup cluster (no ssh access)

https://gerrit.wikimedia.org/r/658394

Change 658394 merged by Elukey:
[operations/puppet@production] Add more users to the Hadoop Backup cluster (no ssh access)

https://gerrit.wikimedia.org/r/658394

Change 658553 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add an-worker1119 and 1131 to the Hadoop backup cluster

https://gerrit.wikimedia.org/r/658553

Change 658553 merged by Elukey:
[operations/puppet@production] Add an-worker1119 and 1131 to the Hadoop backup cluster

https://gerrit.wikimedia.org/r/658553

Last two nodes added to the cluster, in theory it is ready to go. Will leave this task open for a bit in case something is needed :)

Change 661051 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Decommission an-worker1117 from the Hadoop cluster

https://gerrit.wikimedia.org/r/661051

Change 661051 merged by Elukey:
[operations/puppet@production] Decommission an-worker1117 from the Hadoop cluster

https://gerrit.wikimedia.org/r/661051

elukey set the point value for this task to 13.Feb 11 2021, 11:12 AM
elukey added a project: Analytics-Kanban.
elukey moved this task from Next Up to Done on the Analytics-Kanban board.