Page MenuHomePhabricator

Toolforge: migrate grid to Debian Buster
Open, MediumPublic

Description

Migrate tools/toolsbeta grid to Debian Buster.

Event Timeline

aborrero triaged this task as Medium priority.Mar 17 2021, 11:54 AM
aborrero created this task.
aborrero moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

Mentioned in SAL (#wikimedia-cloud) [2021-03-17T11:56:06Z] <arturo> created puppet prefix 'toolsbeta-buster-sgeexec' (T277653)

Mentioned in SAL (#wikimedia-cloud) [2021-03-17T12:00:24Z] <arturo> create VM toolsbeta-buster-sgeexec-01 (T277653)

Mentioned in SAL (#wikimedia-cloud) [2021-03-17T12:38:10Z] <arturo> created puppet prefix 'toolsbeta-buster-gridmaster' (T277653)

Mentioned in SAL (#wikimedia-cloud) [2021-03-17T12:39:55Z] <arturo> created VM toolsbeta-buster-gridmaster (T277653)

Just a thought, since the grid master and shadow are not exec nodes, they can probably be swapped with buster hosts easily, but they have to have the same name as the old ones to make it smooth (with the original nodes shut down). We don't need separate grids with buster, just separate queues for exec nodes and separate bastions. All state should be on NFS, and the grid doesn't use certs. It just uses DNS to id the host.

In IRC, @aborrero pointed out that the VMs cannot have the same name (which makes me think that really, the "master" dns should be a service address with aliases for swapping out the VMs if we were keeping the grid for the long term). So:
The shadow is a warm standby, so it can be done any time (with no take-backs)...as long as it works in toolsbeta. It takes the grid 10 min or so to actually fail over, but you can failover the master manually, destroy the master VM while the shadow handles requests and then replace it and failback.

On the exec queues:
The release value is still defined on host templates via complex_values slots=<%= 32 * @processorcount %>,release=<%= @lsbdistcodename %> in puppet, so it should work. The thing to "undo" is our deprecation of it in jsub, jstart, maybe crontab? and webservice.

bstorm@tools-sgegrid-master:~$ qconf -se tools-sgeexec-0939.tools.eqiad.wmflabs
hostname              tools-sgeexec-0939.tools.eqiad.wmflabs
load_scaling          NONE
complex_values        h_vmem=24G,slots=64,release=stretch
load_values           arch=lx-amd64,num_proc=4,mem_total=7978.445312M, \
                      swap_total=24474.992188M,virtual_total=32453.437500M, \
                      m_topology=SCSCSCSC,m_socket=4,m_core=4,m_thread=4, \
                      load_avg=4.400000,load_short=4.470000, \
                      load_medium=4.400000,load_long=4.400000, \
                      mem_free=7297.328125M,swap_free=24474.992188M, \
                      virtual_free=31772.320312M,mem_used=681.117188M, \
                      swap_used=0.000000M,virtual_used=681.117188M, \
                      cpu=98.100000,m_topology_inuse=SCSCSCSC, \
                      np_load_avg=1.100000,np_load_short=1.117500, \
                      np_load_medium=1.100000,np_load_long=1.100000
processors            4
user_lists            NONE
xuser_lists           NONE
projects              NONE
xprojects             NONE
usage_scaling         NONE
report_variables      NONE

The arg to qsub looks like what's in https://wikitech.wikimedia.org/wiki/Help:Toolforge/Grid#Specifying_an_operating_system_release

It used to be a required arg for webservice. Honestly, we can just specify a default instead (and change the default in time).

Mentioned in SAL (#wikimedia-cloud) [2021-03-18T12:47:32Z] <arturo> delete puppet prefix toolsbeta-buster-grirdmaster (no longer useful) T277653

Mentioned in SAL (#wikimedia-cloud) [2021-03-18T12:48:13Z] <arturo> destroy VM toolsbeta-buster-gridmaster (no longer useful) T277653

Mentioned in SAL (#wikimedia-cloud) [2021-03-18T12:51:30Z] <arturo> rebuild toolsbeta-sgegrid-shadow instance as debian buster (T277653)

Change 673267 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: grid: base: stop using LVM

https://gerrit.wikimedia.org/r/673267

Change 673267 abandoned by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: grid: base: stop using LVM

Reason:
https://gerrit.wikimedia.org/r/c/operations/puppet/ /672456

https://gerrit.wikimedia.org/r/673267

Mentioned in SAL (#wikimedia-cloud) [2021-03-18T18:44:33Z] <arturo> replacing toolsbeta-sgegrid-master with a Debian Buster VM (T277653)

Deleting the grid master old VM also deletes the DNS record: toolsbeta-sgegrid-master.toolsbeta.eqiad.wmflabs which we apparently use somewhere.

toolsbeta.test@toolsbeta-bastion-05:~$ qstat
error: unable to send message to qmaster using port 6444 on host "toolsbeta-sgegrid-master.toolsbeta.eqiad.wmflabs": can't resolve host name

Correct. The new server must be named that same name.

It's project hiera:

commit d240fcf79853a9cd788759e375aade2557a82479 (HEAD -> master, origin/master, origin/HEAD)
Author: aborrero <aborrero@wikimedia.org>
Date:   Thu Mar 18 18:57:35 2021 +0000

    Horizon auto commit for user aborrero

diff --git a/toolsbeta/_.yaml b/toolsbeta/_.yaml
index fb8039f..7a4e05c 100644
--- a/toolsbeta/_.yaml
+++ b/toolsbeta/_.yaml
@@ -105,7 +105,7 @@ role::aptly::client::servername: tools-sge-services-03.tools.eqiad.wmflabs
 role::labs::nfsclient::lookupcache: all
 role::puppetmaster::standalone::autosign: true
 role::toollabs::k8s::master::use_puppet_certs: true
-sonofgridengine::gridmaster: toolsbeta-sgegrid-master.toolsbeta.eqiad.wmflabs
+sonofgridengine::gridmaster: toolsbeta-sgegrid-master.toolsbeta.eqiad1.wikimedia.cloud
 ssh::server::disable_nist_kex: false
 ssh::server::enable_hba: true
 ssh::server::explicit_macs: false

Ooooohhhhhhhh, the wmflabs thing. I hope that works. It might...do something bad, but if it does, the fix is to create an alias or add the new name directly to the admin hosts.

The new master shows interesting things:

aborrero@toolsbeta-sgegrid-master:~ $ sudo qstat
/usr/share/gridengine/util/arch: 203: /usr/share/gridengine/util/arch: /usr/bin/cpp: not found
/var/lib/gridengine/util/arch: 203: /var/lib/gridengine/util/arch: /usr/bin/cpp: not found
error: commlib error: access denied (client IP resolved to host name "localhost". This is not identical to clients host name "toolsbeta-sgegrid-master.toolsbeta.eqiad1.wikimedia.cloud")
error: unable to send message to qmaster using port 6444 on host "toolsbeta-sgegrid-master.toolsbeta.eqiad1.wikimedia.cloud": got send error

Also, the bastion fails to interact with it:

toolsbeta.test@toolsbeta-bastion-05:~$ qstat
error: denied: host "toolsbeta-bastion-05.toolsbeta.eqiad1.wikimedia.cloud" is neither submit nor admin host

So this thing is not that simple as replacing the VMs.

also I'm a bit confused about this 127.0.1.1entry in /etc/hosts:

127.0.1.1	toolsbeta-sgegrid-master.toolsbeta.eqiad1.wikimedia.cloud	toolsbeta-sgegrid-master
127.0.0.1	localhost

also I'm a bit confused about this 127.0.1.1entry in /etc/hosts:

127.0.1.1	toolsbeta-sgegrid-master.toolsbeta.eqiad1.wikimedia.cloud	toolsbeta-sgegrid-master
127.0.0.1	localhost

That looks like the standard format from Ubuntu, unless I'm not reading it right.

So this thing is not that simple as replacing the VMs.

Yuck. It *should* have been that easy, but I'm now wondering if things changed in the buster image that are incompatible AND the change of FQDN might be death for the whole cluster unless we made sure the new name is listed among the admin hosts or the aliases already.

This is the tools grid master's /etc/hosts:

bstorm@tools-sgegrid-master:~$ cat /etc/hosts
# HEADER: This file was autogenerated at 2019-12-04 20:47:27 +0000
# HEADER: by puppet.  While it can still be managed manually, it
# HEADER: is definitely not recommended.
127.0.0.1	localhost
::1	localhost	ip6-localhost ip6-loopback
ff02::1	ip6-allnodes
ff02::2	ip6-allrouters

172.16.4.197	tools-sgegrid-master.tools.eqiad.wmflabs	tools-sgegrid-master
10.64.16.149	statsd.eqiad.wmnet	statsd

I was only able to make it all work right by changing the 127.0.0.1 line in /etc/hosts on your server to:

127.0.0.1	localhost   toolsbeta-sgegrid-master.toolsbeta.eqiad1.wikimedia.cloud

and restarting gridengine:

bstorm@toolsbeta-sgegrid-master:~$ sudo systemctl restart gridengine-master.service
bstorm@toolsbeta-sgegrid-master:~$ qstat -f
/usr/share/gridengine/util/arch: 203: /usr/share/gridengine/util/arch: /usr/bin/cpp: not found
/var/lib/gridengine/util/arch: 203: /var/lib/gridengine/util/arch: /usr/bin/cpp: not found
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
task@toolsbeta-sgeexec-0901.to BI    0/0/50         -NA-     lx-amd64      au
---------------------------------------------------------------------------------
continuous@toolsbeta-sgeexec-0 BC    0/0/50         -NA-     lx-amd64      au
---------------------------------------------------------------------------------
webgrid-generic@toolsbeta-sgew B     0/0/256        -NA-     lx-amd64      au
---------------------------------------------------------------------------------
webgrid-lighttpd@toolsbeta-sge B     0/0/256        -NA-     lx-amd64      au

I put that back (so it's broken again), but hopefully that helps. The aliases seem to work to keep it from not recognizing the host once the master itself knows its name.

That clearly doesn't seem to match the stretch behavior (also the cpp errors stay there).

Change 673448 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] sonofgridengine: master: ensure cpp package is installed

https://gerrit.wikimedia.org/r/673448

Change 673448 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] sonofgridengine: master: ensure cpp package is installed

https://gerrit.wikimedia.org/r/673448

Mentioned in SAL (#wikimedia-cloud) [2021-03-23T11:07:21Z] <arturo> drop and build again the VM toolsbeta-sgregrid-shadow (T277653)

Mentioned in SAL (#wikimedia-cloud) [2021-03-23T11:22:04Z] <arturo> drop and build again the VM toolsbeta-sgregrid-master (T277653)

Mentioned in SAL (#wikimedia-cloud) [2021-03-23T12:15:58Z] <arturo> delete & re-create VM tools-sgegrid-shadow as Debian Buster (T277653)

Mentioned in SAL (#wikimedia-cloud) [2021-03-25T16:05:26Z] <bstorm> failed over the tools grid to the shadow master T277653

Mentioned in SAL (#wikimedia-cloud) [2021-03-25T16:20:20Z] <arturo> rebuilding tools-sgegrid-master VM as debian buster (T277653)

Mentioned in SAL (#wikimedia-cloud) [2021-03-25T17:46:11Z] <arturo> rebooting tools-sgeexec-* nodes to account for new grid master (T277653)

Mentioned in SAL (#wikimedia-cloud) [2021-03-25T19:30:39Z] <bstorm> forced deletion of all jobs stuck in a deleting state T277653

Change 677873 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] sonofgridengine: grid-configurator: introduce support for the new domain

https://gerrit.wikimedia.org/r/677873

I took a look at the patches. One thing you are running into is the source of truth for that script being NFS-hosted files generated by puppet runs. There are files under /data/project/.system_sge/gridengine/etc/ that are generated by puppet on each instance (for instance modules/profile/manifests/toolforge/grid/node/compute.pp). Whatever puppet believes the host FQDN is, that is what will get left there. That's why deleting a node requires manually removing those files. grid-configurator expects that NFS dir to be accurate. That could be forced to be the hostname with $project.eqiad1.wikimeida.cloud instead. The files only actually do anything when grid-configurator runs, so it would be "safe" to change their names and config. The collectors from ::sonofgridengine build some of the queues (the web ones I believe) as well, so you might need to be a bit careful there to find all the places where the fqdn is used directly in puppet. Mistakes should be glossed over by the aliases file, though.

Mentioned in SAL (#wikimedia-cloud) [2021-04-08T18:25:16Z] <bstorm> cleaned up the deprecated entries in /data/project/.system_sge/gridengine/etc/submithosts for tools-sgegrid-master and tools-sgegrid-shadow using the old fqdns T277653

Mentioned in SAL (#wikimedia-cloud) [2021-04-08T18:27:41Z] <bstorm> cleaned up the deprecated entries in /data/project/.system_sge/gridengine/etc/submithosts for toolsbeta-sgegrid-master and toolsbeta-sgegrid-shadow using the old fqdns T277653

Change 678043 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] gridengine: set grid-configurator source files to use new domain name

https://gerrit.wikimedia.org/r/678043

Mentioned in SAL (#wikimedia-cloud) [2021-04-13T15:36:47Z] <arturo> created VM toolsbeta-sgeexec-0903 (buster) (T277653)

Mentioned in SAL (#wikimedia-cloud) [2021-04-13T16:41:51Z] <arturo> create VM toolsbeta-sgeexec-1002 (T277653)

Change 678043 merged by Bstorm:

[operations/puppet@production] gridengine: set grid-configurator source files to use new domain name

https://gerrit.wikimedia.org/r/678043

Change 680038 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] gridengine: set additional grid-configurator source files to new domain

https://gerrit.wikimedia.org/r/680038

Change 680038 merged by Bstorm:

[operations/puppet@production] gridengine: set additional grid-configurator source files to new domain

https://gerrit.wikimedia.org/r/680038

@aborrero just FYI, I noticed that diamond collectors don't run correctly on the new exec node in toolsbeta (0902). Every puppet run shows:
Notice: /Stage[main]/Diamond/Service[diamond]/ensure: ensure changed 'stopped' to 'running' (corrective)

I cleaned up the source dirs in toolsbeta, so when your patch is ready, you can test away (I didn't remove the workarounds for the legacy domain in there yet). I did *not* check carefully if any of the host records are for systems that were deleted. If anything was created and then deleted, they will cause failures if not cleaned up in the source dirs. I only removed things with the eqiad.wmflabs domain name.

If there's anything missing like that, the places to delete it are in /data/project/.system_sge/gridengine/etc/ AND /data/project/.system_sge/gridengine/collectors/. That last part has sent us on wild goose chases before :)

@aborrero just FYI, I noticed that diamond collectors don't run correctly on the new exec node in toolsbeta (0902). Every puppet run shows:
Notice: /Stage[main]/Diamond/Service[diamond]/ensure: ensure changed 'stopped' to 'running' (corrective)

This is this traceback:

Apr 16 10:16:21 toolsbeta-sgeexec-0902 systemd[1]: Started diamond - A system statistics collector for graphite.
Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]: Changed UID: 0 () GID: 0 ().
Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]: Process SyncManager-1:
Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]: Traceback (most recent call last):
Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]:   File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]:     self.run()
Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]:   File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]:     self._target(*self._args, **self._kwargs)
Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]:   File "/usr/lib/python2.7/multiprocessing/managers.py", line 550, in _run_server
Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]:     server = cls._Server(registry, address, authkey, serializer)
Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]:   File "/usr/lib/python2.7/multiprocessing/managers.py", line 162, in __init__
Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]:     self.listener = Listener(address=address, backlog=16)
Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]:   File "/usr/lib/python2.7/multiprocessing/connection.py", line 132, in __init__
Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]:     self._listener = SocketListener(address, family, backlog)
Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]:   File "/usr/lib/python2.7/multiprocessing/connection.py", line 256, in __init__
Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]:     self._socket.bind(address)
Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]:   File "/usr/lib/python2.7/socket.py", line 228, in meth
Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]:     return getattr(self._sock,name)(*args)
Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]: error: [Errno 1] Operation not permitted
Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]: Unhandled exception:
Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]: traceback: Traceback (most recent call last):
Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]:   File "/usr/bin/diamond", line 281, in main
Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]:     server = Server(configfile=options.configfile)
Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]:   File "/usr/lib/python2.7/dist-packages/diamond/server.py", line 59, in __init__
Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]:     self.manager = multiprocessing.Manager()
Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]:   File "/usr/lib/python2.7/multiprocessing/__init__.py", line 99, in Manager
Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]:     m.start()
Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]:   File "/usr/lib/python2.7/multiprocessing/managers.py", line 528, in start
Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]:     self._address = reader.recv()
Apr 16 10:16:22 toolsbeta-sgeexec-0902 diamond[32249]: EOFError

I think this only happens on this particular Debian Stretch node. I can't find the issue on other Buster or Stretch servers, so I would just ignore it.

Change 677873 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] sonofgridengine: grid-configurator: introduce support for the new domain

https://gerrit.wikimedia.org/r/677873

Ok, after that being merged I think we can move on to:

Mentioned in SAL (#wikimedia-cloud) [2021-04-16T23:15:32Z] <bstorm> cleaned up all source files for the grid with the old domain name to enable future node creation T277653