Details
Event Timeline
current process for removing a node manually:
- depooled using exec-manage
- deleted the VM (including the usual dance for puppet node deactivate / puppet node clean)
- ran grid-configurator
root@tools-sgegrid-master:~# sudo grid-configurator --dry-run --all-domains 2022-05-30 13:16:11,361 INFO would delete file /data/project/.system_sge/store/execnode-tools-sgeexec-0901.tools.eqiad1.wikimedia.cloud, VM doesn't exists (dry run) 2022-05-30 13:16:11,362 INFO would delete file /data/project/.system_sge/store/hostkey-tools-sgeexec-0901.tools.eqiad.wmflabs, VM doesn't exists (dry run) 2022-05-30 13:16:11,362 INFO would delete file /data/project/.system_sge/store/execnode-tools-sgeexec-0901.tools.eqiad.wmflabs, VM doesn't exists (dry run) 2022-05-30 13:16:11,368 INFO would remove /data/project/.system_sge/gridengine/collectors/hostgroups/tools-sgeexec-0901.tools.eqiad1.wikimedia.cloud, 'tools-sgeexec-0901' is not a VM (dry run) 2022-05-30 13:16:11,371 INFO would remove /data/project/.system_sge/gridengine/etc/hosts/tools-sgeexec-0901.tools.eqiad1.wikimedia.cloud, 'tools-sgeexec-0901' is not a VM (dry run) 2022-05-30 13:16:11,373 INFO would rm 'tools-sgeexec-0901.tools.eqiad1.wikimedia.cloud' from 'hostlist' parameter at /data/project/.system_sge/gridengine/etc/hostgroups/@general (dry run) 2022-05-30 13:16:14,594 INFO Would delete exec host: qconf -de tools-sgeexec-0901.tools.eqiad.wmflabs 2022-05-30 13:16:14,761 INFO Would delete submit host: qconf -ds tools-sgeexec-0901.tools.eqiad.wmflabs root@tools-sgegrid-master:~# sudo grid-configurator --all-domains 2022-05-30 13:17:18,697 WARNING command 'qconf -Mhgrp /data/project/.system_sge/gridengine/etc/hostgroups/@general' generated stderr: 'unable to resolve host "tools-sgeexec-0901.tools.eqiad1.wikimedia.cloud" ' 2022-05-30 13:17:18,701 INFO deleting file /data/project/.system_sge/store/execnode-tools-sgeexec-0901.tools.eqiad1.wikimedia.cloud, VM doesn't exists 2022-05-30 13:17:18,702 INFO deleting file /data/project/.system_sge/store/hostkey-tools-sgeexec-0901.tools.eqiad.wmflabs, VM doesn't exists 2022-05-30 13:17:18,704 INFO deleting file /data/project/.system_sge/store/execnode-tools-sgeexec-0901.tools.eqiad.wmflabs, VM doesn't exists 2022-05-30 13:17:18,710 INFO removing /data/project/.system_sge/gridengine/collectors/hostgroups/tools-sgeexec-0901.tools.eqiad1.wikimedia.cloud, 'tools-sgeexec-0901' is not a VM 2022-05-30 13:17:18,713 INFO would remove /data/project/.system_sge/gridengine/etc/hosts/tools-sgeexec-0901.tools.eqiad1.wikimedia.cloud, 'tools-sgeexec-0901' is not a VM (dry run) 2022-05-30 13:17:18,717 INFO would rm 'tools-sgeexec-0901.tools.eqiad1.wikimedia.cloud' from 'hostlist' parameter at /data/project/.system_sge/gridengine/etc/hostgroups/@general (dry run) 2022-05-30 13:17:22,027 WARNING command 'qconf -de tools-sgeexec-0901.tools.eqiad.wmflabs' generated stderr: 'Host object "tools-sgeexec-0901.tools.eqiad.wmflabs" is still referenced in cluster queue "continuous". ' 2022-05-30 13:17:22,349 WARNING command 'qconf -ds tools-sgeexec-0901.tools.eqiad.wmflabs' generated stderr: 'root@tools-sgegrid-master.tools.eqiad1.wikimedia.cloud removed "tools-sgeexec-0901.tools.eqiad.wmflabs" from submit host list ' 2022-05-30 13:17:23,564 WARNING command 'qconf -Mq /data/project/.system_sge/gridengine/etc/queues/webgrid-lighttpd' generated stderr: 'root@tools-sgegrid-master.tools.eqiad1.wikimedia.cloud modified "webgrid-lighttpd" in cluster queue list ' 2022-05-30 13:17:24,428 WARNING command 'qconf -Mq /data/project/.system_sge/gridengine/etc/queues/webgrid-generic' generated stderr: 'root@tools-sgegrid-master.tools.eqiad1.wikimedia.cloud modified "webgrid-generic" in cluster queue list
- removed from general host list
root@tools-sgegrid-master:~# sudo qconf -mhgrp @general root@tools-sgegrid-master.tools.eqiad1.wikimedia.cloud modified "@general" in host group list
- ran grid-configurator again
root@tools-sgegrid-master:~# sudo grid-configurator --all-domains 2022-05-30 13:20:17,721 WARNING command 'qconf -Mhgrp /data/project/.system_sge/gridengine/etc/hostgroups/@general' generated stderr: 'root@tools-sgegrid-master.tools.eqiad1.wikimedia.cloud modified "@general" in host group list ' 2022-05-30 13:20:17,736 INFO would remove /data/project/.system_sge/gridengine/etc/hosts/tools-sgeexec-0901.tools.eqiad1.wikimedia.cloud, 'tools-sgeexec-0901' is not a VM (dry run) 2022-05-30 13:20:21,114 WARNING command 'qconf -de tools-sgeexec-0901.tools.eqiad.wmflabs' generated stderr: 'root@tools-sgegrid-master.tools.eqiad1.wikimedia.cloud removed "tools-sgeexec-0901.tools.eqiad.wmflabs" from execution host list ' 2022-05-30 13:20:22,700 WARNING command 'qconf -Mq /data/project/.system_sge/gridengine/etc/queues/webgrid-lighttpd' generated stderr: 'root@tools-sgegrid-master.tools.eqiad1.wikimedia.cloud modified "webgrid-lighttpd" in cluster queue list ' 2022-05-30 13:20:23,526 WARNING command 'qconf -Mq /data/project/.system_sge/gridengine/etc/queues/webgrid-generic' generated stderr: 'root@tools-sgegrid-master.tools.eqiad1.wikimedia.cloud modified "webgrid-generic" in cluster queue list '
- removed the host directory by hand
root@tools-sgegrid-master:~# rm -r /data/project/.system_sge/gridengine/etc/hosts/tools-sgeexec-0901.tools.eqiad1.wikimedia.cloud
- removed from /data/project/.system_sge/gridengine/default/common/host_aliases
Change 801385 had a related patch set uploaded (by Majavah; author: Majavah):
[operations/puppet@production] sonofgridengine: grid_configurator: make the grid master a submit host
Change 801770 had a related patch set uploaded (by Majavah; author: Majavah):
[operations/puppet@production] sonofgridengine: grid_configurator: filter 'normal' stderr output
Change 801385 abandoned by Majavah:
[operations/puppet@production] sonofgridengine: grid_configurator: make the grid master a submit host
Reason:
Change 801774 had a related patch set uploaded (by Majavah; author: Majavah):
[operations/puppet@production] sonofgridengine: grid_configurator: remove hostgroup and queue entries
Mentioned in SAL (#wikimedia-cloud) [2022-05-31T16:51:48Z] <taavi> delete tools-sgeexec-0904 for T309525 experimentation
Change 801777 had a related patch set uploaded (by Majavah; author: Majavah):
[operations/puppet@production] sonofgridengine: grid_configurator: remove hosts entries
update on the grid-configurator behaviour:
root@tools-sgegrid-master:~# sudo grid-configurator --all-domains --dry-run 2022-05-31 16:52:43,885 INFO would delete file /data/project/.system_sge/store/execnode-tools-sgeexec-0904.tools.eqiad.wmflabs, VM doesn't exists (dry run) 2022-05-31 16:52:43,886 INFO would delete file /data/project/.system_sge/store/hostkey-tools-sgeexec-0904.tools.eqiad.wmflabs, VM doesn't exists (dry run) 2022-05-31 16:52:43,889 INFO would delete file /data/project/.system_sge/store/execnode-tools-sgeexec-0904.tools.eqiad1.wikimedia.cloud, VM doesn't exists (dry run) 2022-05-31 16:52:43,893 INFO would remove /data/project/.system_sge/gridengine/collectors/hostgroups/tools-sgeexec-0904.tools.eqiad1.wikimedia.cloud, 'tools-sgeexec-0904' is not a VM (dry run) 2022-05-31 16:52:43,897 INFO would rm 'tools-sgeexec-0904.tools.eqiad1.wikimedia.cloud' from 'hostlist' parameter at /data/project/.system_sge/gridengine/etc/hostgroups/@general (dry run) 2022-05-31 16:52:43,909 INFO would remove /data/project/.system_sge/gridengine/etc/hosts/tools-sgeexec-0904.tools.eqiad1.wikimedia.cloud, 'tools-sgeexec-0904' is not a VM (dry run) 2022-05-31 16:52:47,087 INFO Would delete exec host: qconf -de tools-sgeexec-0904.tools.eqiad.wmflabs 2022-05-31 16:52:47,269 INFO Would delete submit host: qconf -ds tools-sgeexec-0904.tools.eqiad.wmflabs root@tools-sgegrid-master:~# sudo grid-configurator --all-domains 2022-05-31 16:53:30,992 WARNING command 'qconf -Mhgrp /data/project/.system_sge/gridengine/etc/hostgroups/@general' generated stderr: 'root@tools-sgegrid-master.tools.eqiad1.wikimedia.cloud modified "@general" in host group list' 2022-05-31 16:53:30,995 INFO deleting file /data/project/.system_sge/store/execnode-tools-sgeexec-0904.tools.eqiad.wmflabs, VM doesn't exists 2022-05-31 16:53:30,996 INFO deleting file /data/project/.system_sge/store/hostkey-tools-sgeexec-0904.tools.eqiad.wmflabs, VM doesn't exists 2022-05-31 16:53:31,000 INFO deleting file /data/project/.system_sge/store/execnode-tools-sgeexec-0904.tools.eqiad1.wikimedia.cloud, VM doesn't exists 2022-05-31 16:53:31,004 INFO removing /data/project/.system_sge/gridengine/collectors/hostgroups/tools-sgeexec-0904.tools.eqiad1.wikimedia.cloud, 'tools-sgeexec-0904' is not a VM 2022-05-31 16:53:31,012 INFO removing mention to 'tools-sgeexec-0904.tools.eqiad1.wikimedia.cloud' from 'hostlist' parameter at /data/project/.system_sge/gridengine/etc/hostgroups/@general 2022-05-31 16:53:31,021 INFO would remove /data/project/.system_sge/gridengine/etc/hosts/tools-sgeexec-0904.tools.eqiad1.wikimedia.cloud, 'tools-sgeexec-0904' is not a VM (dry run) 2022-05-31 16:53:34,370 WARNING command 'qconf -de tools-sgeexec-0904.tools.eqiad.wmflabs' generated stderr: 'Host object "tools-sgeexec-0904.tools.eqiad.wmflabs" is still referenced in cluster queue "task".' 2022-05-31 16:53:34,721 WARNING command 'qconf -ds tools-sgeexec-0904.tools.eqiad.wmflabs' generated stderr: 'root@tools-sgegrid-master.tools.eqiad1.wikimedia.cloud removed "tools-sgeexec-0904.tools.eqiad.wmflabs" from submit host list' root@tools-sgegrid-master:~# sudo grid-configurator --all-domains 2022-05-31 16:53:47,631 WARNING command 'qconf -Mhgrp /data/project/.system_sge/gridengine/etc/hostgroups/@general' generated stderr: 'root@tools-sgegrid-master.tools.eqiad1.wikimedia.cloud modified "@general" in host group list' 2022-05-31 16:53:47,652 INFO would remove /data/project/.system_sge/gridengine/etc/hosts/tools-sgeexec-0904.tools.eqiad1.wikimedia.cloud, 'tools-sgeexec-0904' is not a VM (dry run) 2022-05-31 16:53:50,935 WARNING command 'qconf -de tools-sgeexec-0904.tools.eqiad.wmflabs' generated stderr: 'root@tools-sgegrid-master.tools.eqiad1.wikimedia.cloud removed "tools-sgeexec-0904.tools.eqiad.wmflabs" from execution host list' root@tools-sgegrid-master:~# sudo grid-configurator --all-domains 2022-05-31 16:54:08,341 WARNING command 'qconf -Mhgrp /data/project/.system_sge/gridengine/etc/hostgroups/@general' generated stderr: 'root@tools-sgegrid-master.tools.eqiad1.wikimedia.cloud modified "@general" in host group list' 2022-05-31 16:54:08,351 INFO would remove /data/project/.system_sge/gridengine/etc/hosts/tools-sgeexec-0904.tools.eqiad1.wikimedia.cloud, 'tools-sgeexec-0904' is not a VM (dry run)
Not quite sure why the exec removal failed at first, since the host group is modified before the exec is removed.
Change 801785 had a related patch set uploaded (by Majavah; author: Majavah):
[operations/cookbooks@wmcs] wmcs: toolforge: add a cookbook to remove a grid node
Change 801770 merged by David Caro:
[operations/puppet@production] sonofgridengine: grid_configurator: filter 'normal' stderr output
Change 801774 merged by David Caro:
[operations/puppet@production] sonofgridengine: grid_configurator: remove hostgroup and queue entries
Change 801777 merged by David Caro:
[operations/puppet@production] sonofgridengine: grid_configurator: remove hosts entries
Change 801785 merged by jenkins-bot:
[operations/cookbooks@wmcs] wmcs: toolforge: add a cookbook to remove a grid node