Page MenuHomePhabricator

Repurpose notebook100[3,4]
Closed, ResolvedPublic

Description

The notebook1003/4 hosts were decommed in T249752. The idea is to re-purpose those hosts to:

  • an-launcher1002 - replacement of an-launcher1001, more ram/cores to avoid bottlenecks when lot of jobs are running (like the beginning of the month with sqoop).
  • an-airflow/scheduler - dedicated node for Airflow (or any similar scheduling tool that we'll choose).

The tricky part is to rename DNS hostnames in dns/puppet to avoid troubles, that needs to be done with extreme care.

Event Timeline

elukey triaged this task as High priority.Jun 25 2020, 11:30 AM
elukey created this task.

Change 607764 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set notebook100[3,4] with role::insetup

https://gerrit.wikimedia.org/r/607764

Change 607764 merged by Elukey:
[operations/puppet@production] Set notebook100[3,4] with role::insetup

https://gerrit.wikimedia.org/r/607764

Change 607771 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Clean up old reference to notebook100[3,4] and set PXE to Buster

https://gerrit.wikimedia.org/r/607771

Change 607771 merged by Elukey:
[operations/puppet@production] Clean up old reference to notebook100[3,4] and set PXE to Buster

https://gerrit.wikimedia.org/r/607771

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['notebook1003.eqiad.wmnet', 'notebook1004.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202006251213_elukey_28913.log.

Completed auto-reimage of hosts:

['notebook1003.eqiad.wmnet', 'notebook1004.eqiad.wmnet']

and were ALL successful.

Change 607779 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove notebook1003 from puppet

https://gerrit.wikimedia.org/r/607779

Change 607779 merged by Elukey:
[operations/puppet@production] Remove notebook1003 from puppet

https://gerrit.wikimedia.org/r/607779

cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: notebook1003.eqiad.wmnet

  • notebook1003.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change 607780 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/dns@master] Rename notebook1003 records to an-launcher1002 records

https://gerrit.wikimedia.org/r/607780

Change 607780 merged by Elukey:
[operations/dns@master] Rename notebook1003 records to an-launcher1002 records

https://gerrit.wikimedia.org/r/607780

Mentioned in SAL (#wikimedia-operations) [2020-06-25T12:55:26Z] <elukey> rename notebook1003 to an-launcher1002 - T256363

Change 607781 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add an-launcher1002 to puppet config

https://gerrit.wikimedia.org/r/607781

Change 607781 merged by Elukey:
[operations/puppet@production] Add an-launcher1002 to puppet config

https://gerrit.wikimedia.org/r/607781

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-launcher1002.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202006251305_elukey_12287.log.

Completed auto-reimage of hosts:

['an-launcher1002.eqiad.wmnet']

Of which those FAILED:

['an-launcher1002.eqiad.wmnet']

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-launcher1002.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202006251437_elukey_26775.log.

Change 607808 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/dns@master] Add ipv6 AAAA/PTR records for an-launcher1002

https://gerrit.wikimedia.org/r/607808

Completed auto-reimage of hosts:

['an-launcher1002.eqiad.wmnet']

and were ALL successful.

Change 607808 merged by Elukey:
[operations/dns@master] Add ipv6 AAAA/PTR records for an-launcher1002

https://gerrit.wikimedia.org/r/607808

Change 607819 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Move all analytics timers but RU ones from an-launcher1001 to 1002

https://gerrit.wikimedia.org/r/607819

Change 607819 merged by Elukey:
[operations/puppet@production] Move all analytics timers but RU ones from an-launcher1001 to 1002

https://gerrit.wikimedia.org/r/607819

Change 607839 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove hiera specific overrides for an-launcher1002

https://gerrit.wikimedia.org/r/607839

Change 607839 merged by Elukey:
[operations/puppet@production] Remove hiera specific overrides for an-launcher1002

https://gerrit.wikimedia.org/r/607839

The next step is to rename notebook1004 to something like an-airflow/an-scheduler.

Change 608258 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Decommission an-launcher1001

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608258

Change 608258 merged by Elukey:
[operations/puppet@production] Decommission an-launcher1001

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608258

cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: an-launcher1001.eqiad.wmnet

  • an-launcher1001.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed

Mentioned in SAL (#wikimedia-operations) [2020-06-29T06:50:15Z] <elukey> execute gnt-instance remove an-launcher1001.eqiad.wmnet on ganeti1011 - T256363

Change 608260 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/dns@master] Remove an-launcher1001's records

https://gerrit.wikimedia.org/r/c/operations/dns/ /608260

Change 608260 merged by Elukey:
[operations/dns@master] Remove an-launcher1001's records

https://gerrit.wikimedia.org/r/c/operations/dns/ /608260

Change 609387 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove notebook1004 from production

https://gerrit.wikimedia.org/r/609387

Change 609387 merged by Elukey:
[operations/puppet@production] Remove notebook1004 from production

https://gerrit.wikimedia.org/r/609387

cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: notebook1004.eqiad.wmnet

  • notebook1004.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change 609396 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/dns@master] Rename notebook1004 to an-scheduler1001

https://gerrit.wikimedia.org/r/609396

Change 609396 merged by Elukey:
[operations/dns@master] Rename notebook1004 to an-scheduler1001

https://gerrit.wikimedia.org/r/609396

Change 609398 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add basic setup for an-scheduler1001

https://gerrit.wikimedia.org/r/609398

Change 609398 merged by Elukey:
[operations/puppet@production] Add basic setup for an-scheduler1001

https://gerrit.wikimedia.org/r/609398

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-scheduler1001.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202007030923_elukey_12891.log.

Completed auto-reimage of hosts:

['an-scheduler1001.eqiad.wmnet']

Of which those FAILED:

['an-scheduler1001.eqiad.wmnet']

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-scheduler1001.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202007030955_elukey_8086.log.

Completed auto-reimage of hosts:

['an-scheduler1001.eqiad.wmnet']

and were ALL successful.

All done! The new an-scheduler1001 is currently with a generic puppet role, we'll switch to something more precise when the time comes.

elukey moved this task from In Progress to Done on the Analytics-Kanban board.