Page MenuHomePhabricator

Create the new Hadoop test cluster
Closed, ResolvedPublic

Description

High level things to do:

  1. Help dcops to rack/setup/deploy the new hosts: two masters, one coordinator and three workers. This will likely need some partman config or similar.
  2. Come up with the the new config in puppet for the test cluster. I have gathered some things to keep in mind when bootstrapping a testing cluster https://wikitech.wikimedia.org/wiki/User:Elukey/Analytics/Hadoop
  3. Once all is done, decommission the old test cluster and give the green light to the DC ops team to decommission analytics1028->40.
  4. Special consideration for analytics1041 that runs our Druid testing cluster. We'll need to find a solution, maybe a co-location somewhere or possibly a VM (more likely). 1041 will need to be decommed as well.

See subtasks for more info :)

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+0 -12
operations/puppetproduction+1 -1
operations/puppetproduction+13 -14
operations/puppetproduction+8 -2
operations/puppetproduction+1 -1
operations/puppetproduction+4 -0
operations/puppetproduction+6 -6
operations/puppetproduction+2 -0
operations/dnsmaster+12 -0
operations/puppetproduction+16 -11
labs/privatemaster+0 -0
operations/puppetproduction+0 -5
operations/puppetproduction+1 -0
operations/puppetproduction+3 -0
operations/puppetproduction+0 -7
operations/puppetproduction+1 -1
operations/puppetproduction+10 -10
operations/puppetproduction+0 -5
operations/puppetproduction+0 -5
operations/puppetproduction+13 -9
Show related patches Customize query in gerrit

Event Timeline

elukey triaged this task as Medium priority.Aug 14 2020, 9:47 AM
elukey updated the task description. (Show Details)

The new hosts are ready to be used:

  • an-test-master100[1,2] will be the Hadoop master nodes
  • an-test-coord1001 will be the coordinator/launcher node
  • an-test-worker100[1-3] will be the Hadoop worker nodes

Some notes to think about:

  • The new Hadoop workers have a single RAID10 lvs volume, and the majority of the disk space is under /srv. So we'll have only one directory listed in puppet/hiera for datanode dirs. The current hadoop test cluster is more similar to the production one, in which we have 12 disks on each node (without any raid) and each of them has a separate partition, that is listed in puppet.
  • analytics1041 and analytics1039 are special nodes, namely Druid test and Hue test. Since the hosts will need to be decommed we'll need probably to create some little VMs for Druid/Hue/etc..

Change 633162 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set up the new Analytics Hadoop test cluster

https://gerrit.wikimedia.org/r/633162

Change 633162 merged by Elukey:
[operations/puppet@production] Set up the new Analytics Hadoop test cluster

https://gerrit.wikimedia.org/r/633162

Change 633164 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove min disk available constraint from Hadoop test workers' settings

https://gerrit.wikimedia.org/r/633164

Change 633164 merged by Elukey:
[operations/puppet@production] Remove min disk available constraint from Hadoop test workers' settings

https://gerrit.wikimedia.org/r/633164

Change 633168 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Avoid the analytics keytab for the Analytics Hadoop test master

https://gerrit.wikimedia.org/r/633168

Change 633168 merged by Elukey:
[operations/puppet@production] Avoid the analytics keytab for the Analytics Hadoop test master

https://gerrit.wikimedia.org/r/633168

Change 633187 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Fix typos for Hadoop test cluster's hostnames

https://gerrit.wikimedia.org/r/633187

Change 633187 merged by Elukey:
[operations/puppet@production] Fix typos for Hadoop test cluster's hostnames

https://gerrit.wikimedia.org/r/633187

Change 633386 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set the test coordinator role to an-test-coord1001

https://gerrit.wikimedia.org/r/633386

Change 633386 merged by Elukey:
[operations/puppet@production] Set the test coordinator role to an-test-coord1001

https://gerrit.wikimedia.org/r/633386

Change 633387 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_test_cluster::coordiantor: avoid db backups

https://gerrit.wikimedia.org/r/633387

Change 633387 merged by Elukey:
[operations/puppet@production] role::analytics_test_cluster::coordiantor: avoid db backups

https://gerrit.wikimedia.org/r/633387

Change 633471 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Reduce the HDFS block replication factor for the Hadoop test cluster

https://gerrit.wikimedia.org/r/633471

Change 633471 merged by Elukey:
[operations/puppet@production] Reduce the HDFS block replication factor for the Hadoop test cluster

https://gerrit.wikimedia.org/r/633471

Change 633476 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_test_cluster::coordinator: set /srv/mysql as datadir for mariadb

https://gerrit.wikimedia.org/r/633476

Change 633476 merged by Elukey:
[operations/puppet@production] role::analytics_test_cluster::coordinator: set /srv/mysql as datadir for mariadb

https://gerrit.wikimedia.org/r/633476

The basic test cluster is set up:

an-master100[1,2] - Hadoop masters
an-coord1001 - test coordinator
an-worker100[1-3] - workers

The total HDFS space is around 42T (not a lot), with HDFS block replication factor 2 (not 3 as the main cluster). A couple of things are missing:

  • an-tool1006 (test client) is running Stretch, meanwhile all our stat100x are on Buster, so we should decom that VM and create a new one with Buster.
  • analytics1041 was the test host for Druid, so we should probably create a VM for it.
  • Hue is missing, so probably another little VM is also needed too.

@razzi @klausman the above I think is a good set of little tasks to get familiar with Hadoop and Ganeti (the infra that we use to run VMs - https://wikitech.wikimedia.org/wiki/Ganeti). Let's schedule the work later on, and let me know if there is anything that you'd prefer to do.

Change 633944 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_test_cluster::hadoop::worker: remove unnecessary nrpe disk check

https://gerrit.wikimedia.org/r/633944

Change 633944 merged by Elukey:
[operations/puppet@production] role::analytics_test_cluster::hadoop::worker: remove unnecessary nrpe disk check

https://gerrit.wikimedia.org/r/633944

Change 633945 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add an-test-coord1001's IPs to Kafka Jumbo's ferm rules

https://gerrit.wikimedia.org/r/633945

Change 633946 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/dns@master] Add IPv6's PTR/AAAA records for the new Hadoop test cluster

https://gerrit.wikimedia.org/r/633946

Change 633947 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::hadoop::worker: avoid custom disk space checks for the test cluster

https://gerrit.wikimedia.org/r/633947

Change 633948 had a related patch set uploaded (by Elukey; owner: Elukey):
[labs/private@master] Add fake keytabs for new Hadoop test cluster nodes

https://gerrit.wikimedia.org/r/633948

Change 633948 merged by Elukey:
[labs/private@master] Add fake keytabs for new Hadoop test cluster nodes

https://gerrit.wikimedia.org/r/633948

Change 633947 merged by Elukey:
[operations/puppet@production] profile::hadoop::worker: avoid custom disk space checks for the test cluster

https://gerrit.wikimedia.org/r/633947

Change 633946 merged by Elukey:
[operations/dns@master] Add IPv6's PTR/AAAA records for the new Hadoop test cluster

https://gerrit.wikimedia.org/r/633946

Change 633945 merged by Elukey:
[operations/puppet@production] Add an-test-coord1001's IPs to Kafka Jumbo's ferm rules

https://gerrit.wikimedia.org/r/633945

Change 633973 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::hadoop::master: refactor monitor for HDFS space left

https://gerrit.wikimedia.org/r/633973

Change 633973 merged by Elukey:
[operations/puppet@production] profile::hadoop::master: refactor monitor for HDFS space left

https://gerrit.wikimedia.org/r/633973

Change 635563 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_test_cluster::client: uplaad hive-site.xml

https://gerrit.wikimedia.org/r/635563

Change 635563 merged by Elukey:
[operations/puppet@production] role::analytics_test_cluster::client: upload hive-site.xml

https://gerrit.wikimedia.org/r/635563

Change 635662 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] Add analytics_test_cluster::client role to an-test-client1001

https://gerrit.wikimedia.org/r/635662

Change 635662 merged by Razzi:
[operations/puppet@production] Add analytics_test_cluster::client role to an-test-client1001

https://gerrit.wikimedia.org/r/635662

Created all the webrequest tables from the hive scripts in refinery.

Remaining steps:

  • Replace an-tool1006 with a buster VM (already in progress)
  • Create a new VM for Druid
  • Create a new VM for Hue

Change 635959 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Move upload of hive-site.xml from client to hadoop standby in test

https://gerrit.wikimedia.org/r/635959

Change 635959 merged by Elukey:
[operations/puppet@production] Move upload of hive-site.xml from client to hadoop standby in test

https://gerrit.wikimedia.org/r/635959

Change 637502 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Review settings for the new Druid test cluster

https://gerrit.wikimedia.org/r/637502

Change 637502 merged by Elukey:
[operations/puppet@production] Review settings for the new Druid test cluster

https://gerrit.wikimedia.org/r/637502

Change 637526 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add role::druid::test_analytics::worker to an-test-druid1001

https://gerrit.wikimedia.org/r/637526

Change 637526 merged by Elukey:
[operations/puppet@production] Add role::druid::test_analytics::worker to an-test-druid1001

https://gerrit.wikimedia.org/r/637526

Created https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Test

Bootstrapped Hue/Yarn and Druid, I think we are now good!

The last step is to decommission an-tool1006 and then we are done.

Mentioned in SAL (#wikimedia-operations) [2020-10-30T08:54:11Z] <elukey> decom an-tool1006 (old analytics test vm) - T255139

Change 637638 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Decommission an-tool1006

https://gerrit.wikimedia.org/r/637638

cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: an-tool1006.eqiad.wmnet

  • an-tool1006.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
  • COMMON_STEPS (FAIL)
    • Failed to run the sre.dns.netbox cookbook: Cumin execution failed (exit_code=2)

ERROR: some step on some host failed, check the bolded items above

Change 637638 merged by Elukey:
[operations/puppet@production] Decommission an-tool1006

https://gerrit.wikimedia.org/r/637638