Migrate tools/toolsbeta grid to Debian Buster.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T275864 Toolforge: migrate to Debian Buster | |||
Open | aborrero | T277653 Toolforge: migrate grid to Debian Buster | |||
Resolved | aborrero | T277866 cloud-init: figure out how to change /etc/hosts from cloud-init/vendordata | |||
Resolved | aborrero | T278232 Toolforge: figure out how to work with the new domain in the grid | |||
Open | aborrero | T278748 Toolforge: introduce support for selecting grid queue release | |||
Open | None | T280037 Toolforge: set up monitoring tooling for stretch deprecation | |||
Open | None | T280252 Toolforge Buster bastion no longer tab completes become command |
Event Timeline
Mentioned in SAL (#wikimedia-cloud) [2021-03-17T11:56:06Z] <arturo> created puppet prefix 'toolsbeta-buster-sgeexec' (T277653)
Mentioned in SAL (#wikimedia-cloud) [2021-03-17T12:00:24Z] <arturo> create VM toolsbeta-buster-sgeexec-01 (T277653)
Mentioned in SAL (#wikimedia-cloud) [2021-03-17T12:38:10Z] <arturo> created puppet prefix 'toolsbeta-buster-gridmaster' (T277653)
Mentioned in SAL (#wikimedia-cloud) [2021-03-17T12:39:55Z] <arturo> created VM toolsbeta-buster-gridmaster (T277653)
Just a thought, since the grid master and shadow are not exec nodes, they can probably be swapped with buster hosts easily, but they have to have the same name as the old ones to make it smooth (with the original nodes shut down). We don't need separate grids with buster, just separate queues for exec nodes and separate bastions. All state should be on NFS, and the grid doesn't use certs. It just uses DNS to id the host.
In IRC, @aborrero pointed out that the VMs cannot have the same name (which makes me think that really, the "master" dns should be a service address with aliases for swapping out the VMs if we were keeping the grid for the long term). So:
The shadow is a warm standby, so it can be done any time (with no take-backs)...as long as it works in toolsbeta. It takes the grid 10 min or so to actually fail over, but you can failover the master manually, destroy the master VM while the shadow handles requests and then replace it and failback.
On the exec queues:
The release value is still defined on host templates via complex_values slots=<%= 32 * @processorcount %>,release=<%= @lsbdistcodename %> in puppet, so it should work. The thing to "undo" is our deprecation of it in jsub, jstart, maybe crontab? and webservice.
bstorm@tools-sgegrid-master:~$ qconf -se tools-sgeexec-0939.tools.eqiad.wmflabs hostname tools-sgeexec-0939.tools.eqiad.wmflabs load_scaling NONE complex_values h_vmem=24G,slots=64,release=stretch load_values arch=lx-amd64,num_proc=4,mem_total=7978.445312M, \ swap_total=24474.992188M,virtual_total=32453.437500M, \ m_topology=SCSCSCSC,m_socket=4,m_core=4,m_thread=4, \ load_avg=4.400000,load_short=4.470000, \ load_medium=4.400000,load_long=4.400000, \ mem_free=7297.328125M,swap_free=24474.992188M, \ virtual_free=31772.320312M,mem_used=681.117188M, \ swap_used=0.000000M,virtual_used=681.117188M, \ cpu=98.100000,m_topology_inuse=SCSCSCSC, \ np_load_avg=1.100000,np_load_short=1.117500, \ np_load_medium=1.100000,np_load_long=1.100000 processors 4 user_lists NONE xuser_lists NONE projects NONE xprojects NONE usage_scaling NONE report_variables NONE
The arg to qsub looks like what's in https://wikitech.wikimedia.org/wiki/Help:Toolforge/Grid#Specifying_an_operating_system_release
It used to be a required arg for webservice. Honestly, we can just specify a default instead (and change the default in time).
Mentioned in SAL (#wikimedia-cloud) [2021-03-18T12:47:32Z] <arturo> delete puppet prefix toolsbeta-buster-grirdmaster (no longer useful) T277653
Mentioned in SAL (#wikimedia-cloud) [2021-03-18T12:48:13Z] <arturo> destroy VM toolsbeta-buster-gridmaster (no longer useful) T277653
Mentioned in SAL (#wikimedia-cloud) [2021-03-18T12:51:30Z] <arturo> rebuild toolsbeta-sgegrid-shadow instance as debian buster (T277653)
Change 673267 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: grid: base: stop using LVM
Change 673267 abandoned by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: grid: base: stop using LVM
Reason:
https://gerrit.wikimedia.org/r/c/operations/puppet/ /672456
Mentioned in SAL (#wikimedia-cloud) [2021-03-18T18:44:33Z] <arturo> replacing toolsbeta-sgegrid-master with a Debian Buster VM (T277653)
Deleting the grid master old VM also deletes the DNS record: toolsbeta-sgegrid-master.toolsbeta.eqiad.wmflabs which we apparently use somewhere.
toolsbeta.test@toolsbeta-bastion-05:~$ qstat error: unable to send message to qmaster using port 6444 on host "toolsbeta-sgegrid-master.toolsbeta.eqiad.wmflabs": can't resolve host name
It's project hiera:
commit d240fcf79853a9cd788759e375aade2557a82479 (HEAD -> master, origin/master, origin/HEAD) Author: aborrero <aborrero@wikimedia.org> Date: Thu Mar 18 18:57:35 2021 +0000 Horizon auto commit for user aborrero diff --git a/toolsbeta/_.yaml b/toolsbeta/_.yaml index fb8039f..7a4e05c 100644 --- a/toolsbeta/_.yaml +++ b/toolsbeta/_.yaml @@ -105,7 +105,7 @@ role::aptly::client::servername: tools-sge-services-03.tools.eqiad.wmflabs role::labs::nfsclient::lookupcache: all role::puppetmaster::standalone::autosign: true role::toollabs::k8s::master::use_puppet_certs: true -sonofgridengine::gridmaster: toolsbeta-sgegrid-master.toolsbeta.eqiad.wmflabs +sonofgridengine::gridmaster: toolsbeta-sgegrid-master.toolsbeta.eqiad1.wikimedia.cloud ssh::server::disable_nist_kex: false ssh::server::enable_hba: true ssh::server::explicit_macs: false
Ooooohhhhhhhh, the wmflabs thing. I hope that works. It might...do something bad, but if it does, the fix is to create an alias or add the new name directly to the admin hosts.
The new master shows interesting things:
aborrero@toolsbeta-sgegrid-master:~ $ sudo qstat /usr/share/gridengine/util/arch: 203: /usr/share/gridengine/util/arch: /usr/bin/cpp: not found /var/lib/gridengine/util/arch: 203: /var/lib/gridengine/util/arch: /usr/bin/cpp: not found error: commlib error: access denied (client IP resolved to host name "localhost". This is not identical to clients host name "toolsbeta-sgegrid-master.toolsbeta.eqiad1.wikimedia.cloud") error: unable to send message to qmaster using port 6444 on host "toolsbeta-sgegrid-master.toolsbeta.eqiad1.wikimedia.cloud": got send error
Also, the bastion fails to interact with it:
toolsbeta.test@toolsbeta-bastion-05:~$ qstat error: denied: host "toolsbeta-bastion-05.toolsbeta.eqiad1.wikimedia.cloud" is neither submit nor admin host
So this thing is not that simple as replacing the VMs.
also I'm a bit confused about this 127.0.1.1entry in /etc/hosts:
127.0.1.1 toolsbeta-sgegrid-master.toolsbeta.eqiad1.wikimedia.cloud toolsbeta-sgegrid-master 127.0.0.1 localhost
Yuck. It *should* have been that easy, but I'm now wondering if things changed in the buster image that are incompatible AND the change of FQDN might be death for the whole cluster unless we made sure the new name is listed among the admin hosts or the aliases already.
This is the tools grid master's /etc/hosts:
bstorm@tools-sgegrid-master:~$ cat /etc/hosts # HEADER: This file was autogenerated at 2019-12-04 20:47:27 +0000 # HEADER: by puppet. While it can still be managed manually, it # HEADER: is definitely not recommended. 127.0.0.1 localhost ::1 localhost ip6-localhost ip6-loopback ff02::1 ip6-allnodes ff02::2 ip6-allrouters 172.16.4.197 tools-sgegrid-master.tools.eqiad.wmflabs tools-sgegrid-master 10.64.16.149 statsd.eqiad.wmnet statsd
I was only able to make it all work right by changing the 127.0.0.1 line in /etc/hosts on your server to:
127.0.0.1 localhost toolsbeta-sgegrid-master.toolsbeta.eqiad1.wikimedia.cloud
and restarting gridengine:
bstorm@toolsbeta-sgegrid-master:~$ sudo systemctl restart gridengine-master.service bstorm@toolsbeta-sgegrid-master:~$ qstat -f /usr/share/gridengine/util/arch: 203: /usr/share/gridengine/util/arch: /usr/bin/cpp: not found /var/lib/gridengine/util/arch: 203: /var/lib/gridengine/util/arch: /usr/bin/cpp: not found queuename qtype resv/used/tot. load_avg arch states --------------------------------------------------------------------------------- task@toolsbeta-sgeexec-0901.to BI 0/0/50 -NA- lx-amd64 au --------------------------------------------------------------------------------- continuous@toolsbeta-sgeexec-0 BC 0/0/50 -NA- lx-amd64 au --------------------------------------------------------------------------------- webgrid-generic@toolsbeta-sgew B 0/0/256 -NA- lx-amd64 au --------------------------------------------------------------------------------- webgrid-lighttpd@toolsbeta-sge B 0/0/256 -NA- lx-amd64 au
I put that back (so it's broken again), but hopefully that helps. The aliases seem to work to keep it from not recognizing the host once the master itself knows its name.
That clearly doesn't seem to match the stretch behavior (also the cpp errors stay there).
Change 673448 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] sonofgridengine: master: ensure cpp package is installed
Change 673448 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] sonofgridengine: master: ensure cpp package is installed
Mentioned in SAL (#wikimedia-cloud) [2021-03-23T11:07:21Z] <arturo> drop and build again the VM toolsbeta-sgregrid-shadow (T277653)
Mentioned in SAL (#wikimedia-cloud) [2021-03-23T11:22:04Z] <arturo> drop and build again the VM toolsbeta-sgregrid-master (T277653)
Mentioned in SAL (#wikimedia-cloud) [2021-03-23T12:15:58Z] <arturo> delete & re-create VM tools-sgegrid-shadow as Debian Buster (T277653)
Mentioned in SAL (#wikimedia-cloud) [2021-03-25T16:05:26Z] <bstorm> failed over the tools grid to the shadow master T277653
Mentioned in SAL (#wikimedia-cloud) [2021-03-25T16:20:20Z] <arturo> rebuilding tools-sgegrid-master VM as debian buster (T277653)
Mentioned in SAL (#wikimedia-cloud) [2021-03-25T17:46:11Z] <arturo> rebooting tools-sgeexec-* nodes to account for new grid master (T277653)
Mentioned in SAL (#wikimedia-cloud) [2021-03-25T19:30:39Z] <bstorm> forced deletion of all jobs stuck in a deleting state T277653
Change 677873 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):
[operations/puppet@production] sonofgridengine: grid-configurator: introduce support for the new domain
I took a look at the patches. One thing you are running into is the source of truth for that script being NFS-hosted files generated by puppet runs. There are files under /data/project/.system_sge/gridengine/etc/ that are generated by puppet on each instance (for instance modules/profile/manifests/toolforge/grid/node/compute.pp). Whatever puppet believes the host FQDN is, that is what will get left there. That's why deleting a node requires manually removing those files. grid-configurator expects that NFS dir to be accurate. That could be forced to be the hostname with $project.eqiad1.wikimeida.cloud instead. The files only actually do anything when grid-configurator runs, so it would be "safe" to change their names and config. The collectors from ::sonofgridengine build some of the queues (the web ones I believe) as well, so you might need to be a bit careful there to find all the places where the fqdn is used directly in puppet. Mistakes should be glossed over by the aliases file, though.
Mentioned in SAL (#wikimedia-cloud) [2021-04-08T18:25:16Z] <bstorm> cleaned up the deprecated entries in /data/project/.system_sge/gridengine/etc/submithosts for tools-sgegrid-master and tools-sgegrid-shadow using the old fqdns T277653
Mentioned in SAL (#wikimedia-cloud) [2021-04-08T18:27:41Z] <bstorm> cleaned up the deprecated entries in /data/project/.system_sge/gridengine/etc/submithosts for toolsbeta-sgegrid-master and toolsbeta-sgegrid-shadow using the old fqdns T277653
Change 678043 had a related patch set uploaded (by Bstorm; author: Bstorm):
[operations/puppet@production] gridengine: set grid-configurator source files to use new domain name
Mentioned in SAL (#wikimedia-cloud) [2021-04-13T15:31:34Z] <arturo> live-hacking puppetmaster with https://gerrit.wikimedia.org/r/c/operations/puppet/+/678043/ (T277653)
Mentioned in SAL (#wikimedia-cloud) [2021-04-13T15:36:47Z] <arturo> created VM toolsbeta-sgeexec-0903 (buster) (T277653)
Mentioned in SAL (#wikimedia-cloud) [2021-04-13T16:41:51Z] <arturo> create VM toolsbeta-sgeexec-1002 (T277653)
Change 678043 merged by Bstorm:
[operations/puppet@production] gridengine: set grid-configurator source files to use new domain name
Change 680038 had a related patch set uploaded (by Bstorm; author: Bstorm):
[operations/puppet@production] gridengine: set additional grid-configurator source files to new domain
Change 680038 merged by Bstorm:
[operations/puppet@production] gridengine: set additional grid-configurator source files to new domain
@aborrero just FYI, I noticed that diamond collectors don't run correctly on the new exec node in toolsbeta (0902). Every puppet run shows:
Notice: /Stage[main]/Diamond/Service[diamond]/ensure: ensure changed 'stopped' to 'running' (corrective)
I cleaned up the source dirs in toolsbeta, so when your patch is ready, you can test away (I didn't remove the workarounds for the legacy domain in there yet). I did *not* check carefully if any of the host records are for systems that were deleted. If anything was created and then deleted, they will cause failures if not cleaned up in the source dirs. I only removed things with the eqiad.wmflabs domain name.
If there's anything missing like that, the places to delete it are in /data/project/.system_sge/gridengine/etc/ AND /data/project/.system_sge/gridengine/collectors/. That last part has sent us on wild goose chases before :)
This is this traceback:
Apr 16 10:16:21 toolsbeta-sgeexec-0902 systemd[1]: Started diamond - A system statistics collector for graphite. Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]: Changed UID: 0 () GID: 0 (). Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]: Process SyncManager-1: Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]: Traceback (most recent call last): Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]: File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]: self.run() Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]: File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]: self._target(*self._args, **self._kwargs) Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]: File "/usr/lib/python2.7/multiprocessing/managers.py", line 550, in _run_server Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]: server = cls._Server(registry, address, authkey, serializer) Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]: File "/usr/lib/python2.7/multiprocessing/managers.py", line 162, in __init__ Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]: self.listener = Listener(address=address, backlog=16) Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]: File "/usr/lib/python2.7/multiprocessing/connection.py", line 132, in __init__ Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]: self._listener = SocketListener(address, family, backlog) Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]: File "/usr/lib/python2.7/multiprocessing/connection.py", line 256, in __init__ Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]: self._socket.bind(address) Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]: File "/usr/lib/python2.7/socket.py", line 228, in meth Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]: return getattr(self._sock,name)(*args) Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]: error: [Errno 1] Operation not permitted Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]: Unhandled exception: Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]: traceback: Traceback (most recent call last): Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]: File "/usr/bin/diamond", line 281, in main Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]: server = Server(configfile=options.configfile) Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]: File "/usr/lib/python2.7/dist-packages/diamond/server.py", line 59, in __init__ Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]: self.manager = multiprocessing.Manager() Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]: File "/usr/lib/python2.7/multiprocessing/__init__.py", line 99, in Manager Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]: m.start() Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]: File "/usr/lib/python2.7/multiprocessing/managers.py", line 528, in start Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]: self._address = reader.recv() Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]: EOFError
I think this only happens on this particular Debian Stretch node. I can't find the issue on other Buster or Stretch servers, so I would just ignore it.
Change 677873 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] sonofgridengine: grid-configurator: introduce support for the new domain
Ok, after that being merged I think we can move on to:
- create new debian buster nodes
- start figuring out what about queues and such
- push a bit harder on https://wikitech.wikimedia.org/wiki/News/Toolforge_Stretch_deprecation
Mentioned in SAL (#wikimedia-cloud) [2021-04-16T23:15:32Z] <bstorm> cleaned up all source files for the grid with the old domain name to enable future node creation T277653