Page MenuHomePhabricator

Toolforge: Create a cookbook to decomission a SGE node
Open, LowPublic

Event Timeline

current process for removing a node manually:

  1. depooled using exec-manage
  2. deleted the VM (including the usual dance for puppet node deactivate / puppet node clean)
  3. ran grid-configurator
root@tools-sgegrid-master:~# sudo grid-configurator --dry-run --all-domains
2022-05-30 13:16:11,361 INFO would delete file /data/project/.system_sge/store/execnode-tools-sgeexec-0901.tools.eqiad1.wikimedia.cloud, VM doesn't exists (dry run)
2022-05-30 13:16:11,362 INFO would delete file /data/project/.system_sge/store/hostkey-tools-sgeexec-0901.tools.eqiad.wmflabs, VM doesn't exists (dry run)
2022-05-30 13:16:11,362 INFO would delete file /data/project/.system_sge/store/execnode-tools-sgeexec-0901.tools.eqiad.wmflabs, VM doesn't exists (dry run)
2022-05-30 13:16:11,368 INFO would remove /data/project/.system_sge/gridengine/collectors/hostgroups/tools-sgeexec-0901.tools.eqiad1.wikimedia.cloud, 'tools-sgeexec-0901' is not a VM (dry run)
2022-05-30 13:16:11,371 INFO would remove /data/project/.system_sge/gridengine/etc/hosts/tools-sgeexec-0901.tools.eqiad1.wikimedia.cloud, 'tools-sgeexec-0901' is not a VM (dry run)
2022-05-30 13:16:11,373 INFO would rm 'tools-sgeexec-0901.tools.eqiad1.wikimedia.cloud' from 'hostlist' parameter at /data/project/.system_sge/gridengine/etc/hostgroups/@general (dry run)
2022-05-30 13:16:14,594 INFO Would delete exec host: qconf -de tools-sgeexec-0901.tools.eqiad.wmflabs
2022-05-30 13:16:14,761 INFO Would delete submit host: qconf -ds tools-sgeexec-0901.tools.eqiad.wmflabs
root@tools-sgegrid-master:~# sudo grid-configurator --all-domains
2022-05-30 13:17:18,697 WARNING command 'qconf -Mhgrp /data/project/.system_sge/gridengine/etc/hostgroups/@general' generated stderr: 'unable to resolve host "tools-sgeexec-0901.tools.eqiad1.wikimedia.cloud"
'
2022-05-30 13:17:18,701 INFO deleting file /data/project/.system_sge/store/execnode-tools-sgeexec-0901.tools.eqiad1.wikimedia.cloud, VM doesn't exists
2022-05-30 13:17:18,702 INFO deleting file /data/project/.system_sge/store/hostkey-tools-sgeexec-0901.tools.eqiad.wmflabs, VM doesn't exists
2022-05-30 13:17:18,704 INFO deleting file /data/project/.system_sge/store/execnode-tools-sgeexec-0901.tools.eqiad.wmflabs, VM doesn't exists
2022-05-30 13:17:18,710 INFO removing /data/project/.system_sge/gridengine/collectors/hostgroups/tools-sgeexec-0901.tools.eqiad1.wikimedia.cloud, 'tools-sgeexec-0901' is not a VM
2022-05-30 13:17:18,713 INFO would remove /data/project/.system_sge/gridengine/etc/hosts/tools-sgeexec-0901.tools.eqiad1.wikimedia.cloud, 'tools-sgeexec-0901' is not a VM (dry run)
2022-05-30 13:17:18,717 INFO would rm 'tools-sgeexec-0901.tools.eqiad1.wikimedia.cloud' from 'hostlist' parameter at /data/project/.system_sge/gridengine/etc/hostgroups/@general (dry run)
2022-05-30 13:17:22,027 WARNING command 'qconf -de tools-sgeexec-0901.tools.eqiad.wmflabs' generated stderr: 'Host object "tools-sgeexec-0901.tools.eqiad.wmflabs" is still referenced in cluster queue "continuous".
'
2022-05-30 13:17:22,349 WARNING command 'qconf -ds tools-sgeexec-0901.tools.eqiad.wmflabs' generated stderr: 'root@tools-sgegrid-master.tools.eqiad1.wikimedia.cloud removed "tools-sgeexec-0901.tools.eqiad.wmflabs" from submit host list
'
2022-05-30 13:17:23,564 WARNING command 'qconf -Mq /data/project/.system_sge/gridengine/etc/queues/webgrid-lighttpd' generated stderr: 'root@tools-sgegrid-master.tools.eqiad1.wikimedia.cloud modified "webgrid-lighttpd" in cluster queue list
'
2022-05-30 13:17:24,428 WARNING command 'qconf -Mq /data/project/.system_sge/gridengine/etc/queues/webgrid-generic' generated stderr: 'root@tools-sgegrid-master.tools.eqiad1.wikimedia.cloud modified "webgrid-generic" in cluster queue list
  1. removed from general host list
root@tools-sgegrid-master:~# sudo qconf -mhgrp @general
root@tools-sgegrid-master.tools.eqiad1.wikimedia.cloud modified "@general" in host group list
  1. ran grid-configurator again
root@tools-sgegrid-master:~# sudo grid-configurator --all-domains 
2022-05-30 13:20:17,721 WARNING command 'qconf -Mhgrp /data/project/.system_sge/gridengine/etc/hostgroups/@general' generated stderr: 'root@tools-sgegrid-master.tools.eqiad1.wikimedia.cloud modified "@general" in host group list
'
2022-05-30 13:20:17,736 INFO would remove /data/project/.system_sge/gridengine/etc/hosts/tools-sgeexec-0901.tools.eqiad1.wikimedia.cloud, 'tools-sgeexec-0901' is not a VM (dry run)
2022-05-30 13:20:21,114 WARNING command 'qconf -de tools-sgeexec-0901.tools.eqiad.wmflabs' generated stderr: 'root@tools-sgegrid-master.tools.eqiad1.wikimedia.cloud removed "tools-sgeexec-0901.tools.eqiad.wmflabs" from execution host list
'
2022-05-30 13:20:22,700 WARNING command 'qconf -Mq /data/project/.system_sge/gridengine/etc/queues/webgrid-lighttpd' generated stderr: 'root@tools-sgegrid-master.tools.eqiad1.wikimedia.cloud modified "webgrid-lighttpd" in cluster queue list
'
2022-05-30 13:20:23,526 WARNING command 'qconf -Mq /data/project/.system_sge/gridengine/etc/queues/webgrid-generic' generated stderr: 'root@tools-sgegrid-master.tools.eqiad1.wikimedia.cloud modified "webgrid-generic" in cluster queue list
'
  1. removed the host directory by hand
root@tools-sgegrid-master:~# rm -r /data/project/.system_sge/gridengine/etc/hosts/tools-sgeexec-0901.tools.eqiad1.wikimedia.cloud
  1. removed from /data/project/.system_sge/gridengine/default/common/host_aliases

Change 801385 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] sonofgridengine: grid_configurator: make the grid master a submit host

https://gerrit.wikimedia.org/r/801385

Change 801770 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] sonofgridengine: grid_configurator: filter 'normal' stderr output

https://gerrit.wikimedia.org/r/801770

Change 801385 abandoned by Majavah:

[operations/puppet@production] sonofgridengine: grid_configurator: make the grid master a submit host

Reason:

https://gerrit.wikimedia.org/r/801385

Change 801774 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] sonofgridengine: grid_configurator: remove hostgroup and queue entries

https://gerrit.wikimedia.org/r/801774

Mentioned in SAL (#wikimedia-cloud) [2022-05-31T16:51:48Z] <taavi> delete tools-sgeexec-0904 for T309525 experimentation

Change 801777 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] sonofgridengine: grid_configurator: remove hosts entries

https://gerrit.wikimedia.org/r/801777

update on the grid-configurator behaviour:

root@tools-sgegrid-master:~# sudo grid-configurator --all-domains --dry-run
2022-05-31 16:52:43,885 INFO would delete file /data/project/.system_sge/store/execnode-tools-sgeexec-0904.tools.eqiad.wmflabs, VM doesn't exists (dry run)
2022-05-31 16:52:43,886 INFO would delete file /data/project/.system_sge/store/hostkey-tools-sgeexec-0904.tools.eqiad.wmflabs, VM doesn't exists (dry run)
2022-05-31 16:52:43,889 INFO would delete file /data/project/.system_sge/store/execnode-tools-sgeexec-0904.tools.eqiad1.wikimedia.cloud, VM doesn't exists (dry run)
2022-05-31 16:52:43,893 INFO would remove /data/project/.system_sge/gridengine/collectors/hostgroups/tools-sgeexec-0904.tools.eqiad1.wikimedia.cloud, 'tools-sgeexec-0904' is not a VM (dry run)
2022-05-31 16:52:43,897 INFO would rm 'tools-sgeexec-0904.tools.eqiad1.wikimedia.cloud' from 'hostlist' parameter at /data/project/.system_sge/gridengine/etc/hostgroups/@general (dry run)
2022-05-31 16:52:43,909 INFO would remove /data/project/.system_sge/gridengine/etc/hosts/tools-sgeexec-0904.tools.eqiad1.wikimedia.cloud, 'tools-sgeexec-0904' is not a VM (dry run)
2022-05-31 16:52:47,087 INFO Would delete exec host: qconf -de tools-sgeexec-0904.tools.eqiad.wmflabs
2022-05-31 16:52:47,269 INFO Would delete submit host: qconf -ds tools-sgeexec-0904.tools.eqiad.wmflabs

root@tools-sgegrid-master:~# sudo grid-configurator --all-domains 
2022-05-31 16:53:30,992 WARNING command 'qconf -Mhgrp /data/project/.system_sge/gridengine/etc/hostgroups/@general' generated stderr: 'root@tools-sgegrid-master.tools.eqiad1.wikimedia.cloud modified "@general" in host group list'
2022-05-31 16:53:30,995 INFO deleting file /data/project/.system_sge/store/execnode-tools-sgeexec-0904.tools.eqiad.wmflabs, VM doesn't exists
2022-05-31 16:53:30,996 INFO deleting file /data/project/.system_sge/store/hostkey-tools-sgeexec-0904.tools.eqiad.wmflabs, VM doesn't exists
2022-05-31 16:53:31,000 INFO deleting file /data/project/.system_sge/store/execnode-tools-sgeexec-0904.tools.eqiad1.wikimedia.cloud, VM doesn't exists
2022-05-31 16:53:31,004 INFO removing /data/project/.system_sge/gridengine/collectors/hostgroups/tools-sgeexec-0904.tools.eqiad1.wikimedia.cloud, 'tools-sgeexec-0904' is not a VM
2022-05-31 16:53:31,012 INFO removing mention to 'tools-sgeexec-0904.tools.eqiad1.wikimedia.cloud' from 'hostlist' parameter at /data/project/.system_sge/gridengine/etc/hostgroups/@general
2022-05-31 16:53:31,021 INFO would remove /data/project/.system_sge/gridengine/etc/hosts/tools-sgeexec-0904.tools.eqiad1.wikimedia.cloud, 'tools-sgeexec-0904' is not a VM (dry run)
2022-05-31 16:53:34,370 WARNING command 'qconf -de tools-sgeexec-0904.tools.eqiad.wmflabs' generated stderr: 'Host object "tools-sgeexec-0904.tools.eqiad.wmflabs" is still referenced in cluster queue "task".'
2022-05-31 16:53:34,721 WARNING command 'qconf -ds tools-sgeexec-0904.tools.eqiad.wmflabs' generated stderr: 'root@tools-sgegrid-master.tools.eqiad1.wikimedia.cloud removed "tools-sgeexec-0904.tools.eqiad.wmflabs" from submit host list'

root@tools-sgegrid-master:~# sudo grid-configurator --all-domains 
2022-05-31 16:53:47,631 WARNING command 'qconf -Mhgrp /data/project/.system_sge/gridengine/etc/hostgroups/@general' generated stderr: 'root@tools-sgegrid-master.tools.eqiad1.wikimedia.cloud modified "@general" in host group list'
2022-05-31 16:53:47,652 INFO would remove /data/project/.system_sge/gridengine/etc/hosts/tools-sgeexec-0904.tools.eqiad1.wikimedia.cloud, 'tools-sgeexec-0904' is not a VM (dry run)
2022-05-31 16:53:50,935 WARNING command 'qconf -de tools-sgeexec-0904.tools.eqiad.wmflabs' generated stderr: 'root@tools-sgegrid-master.tools.eqiad1.wikimedia.cloud removed "tools-sgeexec-0904.tools.eqiad.wmflabs" from execution host list'

root@tools-sgegrid-master:~# sudo grid-configurator --all-domains 
2022-05-31 16:54:08,341 WARNING command 'qconf -Mhgrp /data/project/.system_sge/gridengine/etc/hostgroups/@general' generated stderr: 'root@tools-sgegrid-master.tools.eqiad1.wikimedia.cloud modified "@general" in host group list'
2022-05-31 16:54:08,351 INFO would remove /data/project/.system_sge/gridengine/etc/hosts/tools-sgeexec-0904.tools.eqiad1.wikimedia.cloud, 'tools-sgeexec-0904' is not a VM (dry run)

Not quite sure why the exec removal failed at first, since the host group is modified before the exec is removed.

Change 801785 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/cookbooks@wmcs] wmcs: toolforge: add a cookbook to remove a grid node

https://gerrit.wikimedia.org/r/801785

Change 801770 merged by David Caro:

[operations/puppet@production] sonofgridengine: grid_configurator: filter 'normal' stderr output

https://gerrit.wikimedia.org/r/801770

Change 801774 merged by David Caro:

[operations/puppet@production] sonofgridengine: grid_configurator: remove hostgroup and queue entries

https://gerrit.wikimedia.org/r/801774

Change 801777 merged by David Caro:

[operations/puppet@production] sonofgridengine: grid_configurator: remove hosts entries

https://gerrit.wikimedia.org/r/801777

taavi triaged this task as Low priority.

Change 801785 merged by jenkins-bot:

[operations/cookbooks@wmcs] wmcs: toolforge: add a cookbook to remove a grid node

https://gerrit.wikimedia.org/r/801785