Page MenuHomePhabricator

Get shadow master to actually work in toolsbeta
Closed, ResolvedPublic

Description

This task is to move the work on this particular piece of the grid out of T200557 and into a tidier subtask.

The grid master works now (and starts on reboot). The shadow master does the same. However, at this point, not much seems to convince the shadow master to take over. It should monitor a heartbeat file for touches. If we tweak some of the settings around that, we might get better results.

Event Timeline

Bstorm triaged this task as Medium priority.Dec 5 2018, 11:03 PM
Bstorm created this task.

Change 477927 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] sonofgridengine: try setting config variables to tune the shadow master

https://gerrit.wikimedia.org/r/477927

Change 477927 merged by Bstorm:
[operations/puppet@production] sonofgridengine: try setting config variables to tune the shadow master

https://gerrit.wikimedia.org/r/477927

Looking to set:

SGE_CHECK_INTERVAL=45
SGE_GET_ACTIVE_INTERVAL=90

Change 477929 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] sonofgridengine: create the dir for the override conf

https://gerrit.wikimedia.org/r/477929

Change 477929 merged by Bstorm:
[operations/puppet@production] sonofgridengine: create the dir for the override conf

https://gerrit.wikimedia.org/r/477929

Change 477935 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] sonofgridengine: make the shadow master override read only

https://gerrit.wikimedia.org/r/477935

Change 477935 merged by Bstorm:
[operations/puppet@production] sonofgridengine: make the shadow master override read only

https://gerrit.wikimedia.org/r/477935

Change 477940 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] sonofgridengine: grant more control over shadowd using the env settings

https://gerrit.wikimedia.org/r/477940

Change 477940 merged by Bstorm:
[operations/puppet@production] sonofgridengine: grant more control over shadowd using the env settings

https://gerrit.wikimedia.org/r/477940

Change 478041 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] gridengine: simplifying the config and making it more "normal" for grid

https://gerrit.wikimedia.org/r/478041

Change 478041 merged by Bstorm:
[operations/puppet@production] gridengine: simplifying the config and making it more "normal" for grid

https://gerrit.wikimedia.org/r/478041

Change 478063 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] sonofgridengine: point to the actual executable for gridengine

https://gerrit.wikimedia.org/r/478063

Change 478063 merged by Bstorm:
[operations/puppet@production] sonofgridengine: point to the actual executable for gridengine

https://gerrit.wikimedia.org/r/478063

Change 480160 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] sonofgridengine: fixing the master service so it operates best for autostart

https://gerrit.wikimedia.org/r/480160

Change 480160 merged by Bstorm:
[operations/puppet@production] sonofgridengine: fixing the master service so it operates best for autostart

https://gerrit.wikimedia.org/r/480160

Change 480164 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] sonofgridengine: make shadow more consistent with master

https://gerrit.wikimedia.org/r/480164

Change 480164 merged by Bstorm:
[operations/puppet@production] sonofgridengine: make shadow more consistent with master

https://gerrit.wikimedia.org/r/480164

I've run a debugging session and I think I've learned a bit more about this situation. What I did was this:

  • run qmaster in the sge master node (the normal setup)
  • run sge_shadowd in foreground with debug info in the shadow node
  • manually interfere with qmaster and/or the heartbeat file
  • see sge_shadowd responding, trying to spawn new sge_qmaster
  • see systemd complaining, because it tries to restart the sge_qmaster service, which fails because there as been a sge_qmaster proc started by the shadow
root@toolsbeta-sgegrid-shadow:~# . /data/project/.system_sge/gridengine/util/dl.sh
root@toolsbeta-sgegrid-shadow:~# dl 2
root@toolsbeta-sgegrid-shadow:~# SGE_CHECK_INTERVAL=2 SGE_GET_ACTIVE_INTERVAL=5 SGE_ROOT=/data/project/.system_sge/gridengine SGE_CELL=default /usr/sbin/sge_shadowd
     0   9332 140094966475648 --> sge_shadowd() {
[...]
   112   9332         main --> get_qmaster_heartbeat() {
   113   9332         main <-- get_qmaster_heartbeat() ../daemons/common/qmaster_heartbeat.c 104 }
   114   9332         main --> get_qmaster_heartbeat() {
   115   9332         main <-- get_qmaster_heartbeat() ../daemons/common/qmaster_heartbeat.c 104 }

[...]

If you manually delete the /data/project/.system_sge/gridengine/spool/qmaster/heartbeat file, you will start seeing messages like this:

[...]
   184  10385         main <-- get_qmaster_heartbeat() ../daemons/common/qmaster_heartbeat.c 78 }
   185  10385         main     can't read heartbeat file. last_heartbeat=5, heartbeat=4294967295
   186  10385         main --> get_qmaster_heartbeat() {
   187  10385         main <-- get_qmaster_heartbeat() ../daemons/common/qmaster_heartbeat.c 78 }
   188  10385         main     can't read heartbeat file. last_heartbeat=5, heartbeat=4294967295
   189  10385         main --> get_qmaster_heartbeat() {
[...]

Eventually, if timeouts expire, etc, the shadow will actually try to start again the sge_qmaster proc

[...]
   120  10411         main     heartbeat not changed since seconds: 6
   121  10411         main --> check_if_valid_shadow() {
   122  10411         main --> isLocked() {
   123  10411         main <-- isLocked() ../daemons/common/lock.c 99 }
   124  10411         main --> get_qm_name() {
   125  10411         main <-- get_qm_name() ../libs/gdi/qm_name.c 135 }
   126  10411         main --> sge_gethostbyname_retry() {
   127  10411         main <-- sge_gethostbyname_retry() ../libs/uti/sge_hostname.c 359 }
   128  10411         main --> host_in_file() {
   129  10411         main <-- host_in_file() ../daemons/shadowd/shadowd.c 549 }
   130  10411         main     "/usr/sbin"
   131  10411         main     we are a candidate for shadow master
   132  10411         main <-- check_if_valid_shadow() ../daemons/shadowd/shadowd.c 516 }
   133  10411         main --> qmaster_lock() {
   134  10411         main <-- qmaster_lock() ../daemons/common/lock.c 60 }
   135  10411         main --> get_qmaster_heartbeat() {
   136  10411         main <-- get_qmaster_heartbeat() ../daemons/common/qmaster_heartbeat.c 104 }
   137  10411         main     old qmaster name in act_qmaster and old heartbeat
   138  10411         main --> compare_qmaster_names() {
   139  10411         main --> get_qm_name() {
   140  10411         main <-- get_qm_name() ../libs/gdi/qm_name.c 135 }
   141  10411         main     strcmp() of old and new qmaster returns: 0
   142  10411         main <-- compare_qmaster_names() ../daemons/shadowd/shadowd.c 457 }
   143  10411         main --> shadowd_is_old_master_enrolled() {
   144  10411         main     Try to send status information message to previous master host "toolsbeta-sgegrid-master.toolsbeta.eqiad.wmflabs" to port 6444
   145  10411         main     old qmaster is still running
   146  10411         main     endpoint is up since 261 seconds and has status 0
   147  10411         main <-- shadowd_is_old_master_enrolled() ../daemons/shadowd/shadowd.c 147 }
   148  10411         main --> qmaster_unlock() {
   149  10411         main <-- qmaster_unlock() ../daemons/common/lock.c 79 }
   150  10411         main --> get_qmaster_heartbeat() {
   151  10411         main <-- get_qmaster_heartbeat() ../daemons/common/qmaster_heartbeat.c 104 }
   152  10411         main --> get_qmaster_heartbeat() {
[...]

But this is not detected if I kill the sge_qmaster daemon by hand (or SIGSTOP or whatever).

I didn't end with any conclusion yet. Will need further investigation because I detected some cases in which the shadow didn't detect the non-runnin sge_qmaster process (like @Bstorm found already). In that case I wonder which kind of HA is this providing and if it worth investing any time in this.

It's not the greatest system either way. The original setup of using sysV init scripts don't pay any attention to how many instances start, so naturally, that's why I've been fiddling with the forking behavior of the scripts to see if it likes one more than the others. However, this is a service that was written WELL before systemd. Also, the cluster is broken the moment a qmaster starts with the wrong vars.

I would imagine that the shadow should be allowed to start the process when it is a forking setup, but it is picky (this is why I set the units to not run in the foreground in general). If that's all it is, I've figured there must be a way to tell systemd to let it "fork as much as it likes"...which will probably break the init functions like "stop", but if it only does that on the shadow, I'm ok with that. It seems totally irrelevant, but the KillMode of the unit can affect such things https://unix.stackexchange.com/questions/204922/systemd-and-process-spawning I'll poke that if there is time today.

Change 480530 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] sonofgridengine: change the killmode of sge_shadowd to process

https://gerrit.wikimedia.org/r/480530

We could try forgeting about sge_shadowd and build a HA pair using corosync/pacemaker, with a simple check to see if the master is responding.

I'm not sure I trust this sge_shadowd mechanism.

sge_shadowd has proven unreliable in toolforge. However, gridengine is kind of weird at times. I think introducing corosync/pacemaker would make it behave more strangely than it already does. Simply restarting qmaster procs can end up confusing the setup (at least in Oracle Grid Engine). I don't want to spend too much time on something like that when it doesn't seem that hard to modify jsub to point at k8s, for instance. I regard gridengine as a deprecated tool that is well suited to HPC environments and older models of orchestration before anyone talked about orchestration.

Right now, toolforge has no HA much less toolsbeta from what I can tell. It is possible that it never has (looking at old incident reports), partially because of a misconfig introduced during puppetization. If adjusting the killmode fixes it (which it can in theory), I'm good. If it doesn't...there's a manual way to fail over that might work by adding the master role to the shadow host, then updating the act_qmaster file...but it might not work right because of other config on the server that puppet won't clean up. There is the whole issue of blessing it as an admin host, etc. (manually adding shadows as admins is what I do now so they will theoretically pick up correctly).

Change 480530 merged by Bstorm:
[operations/puppet@production] sonofgridengine: change the killmode of sge_shadowd to process

https://gerrit.wikimedia.org/r/480530

Change 480567 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] sonofgridengine: change back to simple foreground service for the master

https://gerrit.wikimedia.org/r/480567

Change 480567 merged by Bstorm:
[operations/puppet@production] sonofgridengine: change back to simple foreground service for the master

https://gerrit.wikimedia.org/r/480567

Change 480569 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] sonofgridengine: correct default file configuration

https://gerrit.wikimedia.org/r/480569

Change 480569 merged by Bstorm:
[operations/puppet@production] sonofgridengine: correct default file configuration

https://gerrit.wikimedia.org/r/480569

Found an issue that it is setting the wrong SGE_ROOT in shadowd. That could be the thing. If so, I am finding that the more I test things the more the foreground versions are just more reliable in systemd (except in some cases...but I think I can get around them now).

Ah, the reason it doesn't take over when the old one is dead is a known issue.

208  26769         main     heartbeat not changed since seconds: 240
209  26769         main --> check_if_valid_shadow() {
210  26769         main --> isLocked() {
211  26769         main <-- isLocked() ../daemons/common/lock.c 99 }
212  26769         main     lock file exits

https://arc.liv.ac.uk/SGE/howto/commonproblems.html#shadow

Deleted the lock file and observing this on debug mode.

This also explains why it would not take over using a manual kill of the master process. Hilarious.

Boom. That's it.

 358  26769 140171247295360 --> sge_set_message_id_output() {
   359  26769 140171247295360 <-- sge_set_message_id_output() ../libs/uti/sge_language.c 479 }
   360  26769         main --> do_wait() {
local configuration toolsbeta-sgegrid-shadow.toolsbeta.eqiad.wmflabs not defined - using global configuration
read job database with 0 entries in 0 seconds
qmaster hard descriptor limit is set to 1048576
qmaster soft descriptor limit is set to 1024
qmaster will use max. 1004 file descriptors for communication
qmaster will accept max. 950 dynamic event clients
starting up SGE 8.1.9 (lx-amd64)
Q:0, AQ:4 J:0(0), H:4(4), C:53, A:2, D:1, P:0, CKPT:1, US:1, PR:0, RQS:0, AR:0, S:nd:0/lf:0
--------------STOP-SCHEDULER-RUN-------------
Q:4, AQ:4 J:0(0), H:4(4), C:53, A:2, D:1, P:0, CKPT:1, US:1, PR:0, RQS:0, AR:0, S:nd:0/lf:0
--------------STOP-SCHEDULER-RUN-------------
Q:4, AQ:4 J:0(0), H:4(4), C:53, A:2, D:1, P:0, CKPT:1, US:1, PR:0, RQS:0, AR:0, S:nd:0/lf:0
--------------STOP-SCHEDULER-RUN-------------
Q:4, AQ:4 J:0(0), H:4(4), C:53, A:2, D:1, P:0, CKPT:1, US:1, PR:0, RQS:0, AR:0, S:nd:0/lf:0
--------------STOP-SCHEDULER-RUN-------------

From bastion:

bstorm@toolsbeta-sgebastion-04:~$ qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
task@toolsbeta-sgeexec-0901.to BI    0/0/50         0.03     lx-amd64
---------------------------------------------------------------------------------
continuous@toolsbeta-sgeexec-0 BC    0/0/50         0.03     lx-amd64
---------------------------------------------------------------------------------
webgrid-generic@toolsbeta-sgew B     0/0/256        0.03     lx-amd64
---------------------------------------------------------------------------------

Also, it fixes itself after you bring the old master back up

Documenting deleting the lock file, making both services foregrounded (per @aborrero 's recommendation) and closing when done.

Change 480576 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] sonofgridengine: foreground the shadow master process

https://gerrit.wikimedia.org/r/480576

Change 480576 merged by Bstorm:
[operations/puppet@production] sonofgridengine: foreground the shadow master process

https://gerrit.wikimedia.org/r/480576