Page MenuHomePhabricator

Basic lighttpd+php webservice fails to run on Stretch grid
Closed, ResolvedPublic

Description

I created a new tool named sge-status and deployed the code from https://phabricator.wikimedia.org/source/tool-gridengine-status/ there.

When I run webservice start from tools-sgebastion-06.tools.eqiad.wmflabs the job is submitted to the new grid, but fails to run because of flags added by to the job by the webservice script:

$ qstat -j 2
==============================================================
job_number:                 2
exec_file:                  job_scripts/2
submission_time:            Thu Dec 20 03:10:21 2018
owner:                      tools.sge-status
uid:                        53920
group:                      tools.sge-status
gid:                        53920
sge_o_home:                 /data/project/sge-status
sge_o_log_name:             tools.sge-status
sge_o_path:                 /usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games
sge_o_shell:                /bin/bash
sge_o_workdir:              /mnt/nfs/labstore-secondary-tools-project/sge-status
sge_o_host:                 tools-sgebastion-06
account:                    sge
stderr_path_list:           NONE:NONE:/data/project/sge-status/error.log
hard resource_list:         h_vmem=4G,release=trusty
mail_list:                  tools.sge-status@tools-sgebastion-06.tools.eqiad.wmflabs
notify:                     FALSE
job_name:                   lighttpd-sge-status
stdout_path_list:           NONE:NONE:/data/project/sge-status/error.log
stdin_path_list:            NONE:NONE:/dev/null
jobshare:                   0
hard_queue_list:            webgrid-lighttpd
env_list:                   TERM=NONE
script_file:                /usr/bin/webservice-runner --register-proxy --type lighttpd
binding:                    NONE
job_type:                   binary
scheduling info:            cannot run in queue "task" because it is not contained in its hard queue list (-q)
                            cannot run in queue "continuous" because it is not contained in its hard queue list (-q)
                            cannot run in queue "webgrid-generic" because it is
not contained in its hard queue list (-q)
                            (-l h_vmem=4G,release=trusty) cannot run at host "tools-sgewebgrid-lighttpd-0901.tools.eqiad.wmflabs" because it offers only hf:release=stretch
                            (-l h_vmem=4G,release=trusty) cannot run at host "tools-sgewebgrid-lighttpd-0902.tools.eqiad.wmflabs" because it offers only hf:release=stretch

This didn't surprise me a lot, but what happened next did. The webservicemonitor watchdog script for the legacy grid saw the new ~tools.sge-status/service.manifest file, noticed that the associated job was not running on the legacy grid, and started it there. End result is that the lighttpd process is running on the legacy grid rather than the new grid.

So 3 bugs in one:

  • new grid is missing the expected 'webgrid-lighttpd' queue
  • webservice is still adding '-l release=trusty' to submitted jobs
  • webservicemonitor doesn't know about separate grids and will launch any gridengine service.manifest job on the grid that the monitor is attached to

Event Timeline

bd808 triaged this task as Medium priority.
bd808 created this task.

Change 480900 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/software/tools-webservice@master] Remove 'release' qsub label

https://gerrit.wikimedia.org/r/480900

Change 480901 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/software/tools-webservice@master] Track platform of submit host in service.manifest

https://gerrit.wikimedia.org/r/480901

Change 480902 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/software/tools-manifest@master] Respect 'distribution' from service.manifest

https://gerrit.wikimedia.org/r/480902

Change 480902 merged by Bstorm:
[operations/software/tools-manifest@master] Respect 'distribution' from service.manifest

https://gerrit.wikimedia.org/r/480902

Change 480900 merged by Bstorm:
[operations/software/tools-webservice@master] Remove 'release' qsub label

https://gerrit.wikimedia.org/r/480900

Change 480901 merged by Bstorm:
[operations/software/tools-webservice@master] Track platform of submit host in service.manifest

https://gerrit.wikimedia.org/r/480901

Mentioned in SAL (#wikimedia-cloud) [2018-12-21T00:01:30Z] <bd808> Installed toollabs-webservice 0.43 on tools-bastion-02 for T212390

Mentioned in SAL (#wikimedia-cloud) [2018-12-21T00:19:11Z] <bd808> Installed toollabs-webservice 0.43 on all hosts for T212390

Mentioned in SAL (#wikimedia-cloud) [2018-12-21T00:22:32Z] <bd808> Rebuiliding all docker containers with toollabs-webservice 0.43 for T212390

Mentioned in SAL (#wikimedia-cloud) [2018-12-21T00:35:56Z] <bd808> Installed tools-manifest 0.14 for T212390

The original issues are resolved, but now that we are a layer deeper we have another issue to address. It seems that the webservice-runner process can not communicate with the dynamicproxy instance. This is probably a cross-region issue as the proxies are still running in the 'eqiad' region and the new grid engine hosts are in the 'eqiad-1' region.

Added 172.16.0.0/21 to port 5669 to the proxy security group for project-proxy. I figure basically anywhere in the proxy mess that is specifically opened for the 10. network is going to need the new region eventually--even if that likely isn't the problem here.

Also note that once the queues have errored, they won't re-run the job. They acquire an E status that won't clear even if you reboot the instance, so we can get faked out by that blocking a run when we fix things. qmod -c '*' will clear all errors--that caused me grief last night.

Inserted some debugging print statements on the exec node and found that the job fails after connecting and sending the port info.

sending 172.16.4.211 at 33909
res is

The response is blank. Looking for errors proxy-side. I suspect it cannot connect *back* on that port.

Found the python script that is supposed to record that information in redis then reply with 'ok'. Looking for some kind of log or something from it.

identd seems to work from the proxy to the exec node...

We had the wrong proxy node set for the new grid in hiera. Fixed that. Looks MUCH better.

From eqiad1-r:

curl http://172.16.4.211:40649
{"errors":[{"id":"g5lshep1-f8434934","status":500,"title":"Internal Server Error"}]}

from eqiad:

curl http://172.16.4.211:40649
curl: (7) Failed to connect to 172.16.4.211 port 40649: Connection timed out

We will need permissive rules for webgrid nodes to make this stuff work.

Bstorm claimed this task.

Added this rule Ingress IPv4 TCP 1024 - 65535 10.68.16.0/21 to the webserver group which is used for all webgrid nodes (and I think k8s as well). It will be needed during moves of servers to the new region no matter what until we move the proxy server and we have to reverse it.

This was wonderfully educational about our proxy setup :)

The services should now work fine.