Page MenuHomePhabricator

Migrate kmlexport from Toolforge GridEngine to Toolforge Kubernetes
Closed, ResolvedPublic

Description

Hi,

I have one issue with migrating kmlexport from grid cron to toolforge jobs. I can not run webservice restart from a script from a scheduled job. Time to time it happens that the lighttpd goes down (overload, bad request...) and therefore I run scheduled cron job on grid once in an hour to check, if lighttpd is up and if not, then restart it.

This is not possible with toolforge jobs as it seems it can't see webservice or lighttpd commands.

Is there a way to do it using toolforge jobs? Or is there any other way, for example in webservice/lighttpd config? I would like to restore this 1 hour limit for kmlexport being offline as it is still used a lot throughout wikis and commons.


Kindly migrate your tool(https://grid-deprecation.toolforge.org/t/kmlexport) from Toolforge GridEngine to Toolforge Kubernetes.

Toolforge GridEngine is getting deprecated.
See: https://techblog.wikimedia.org/2022/03/14/toolforge-and-grid-engine/

Please note that a volunteer may perform this migration if this has not been done after some time.
If you have already migrated this tool, kindly mark this as resolved.

If you would rather shut down this tool, kindly do so and mark this as resolved.

Useful Resources:
Migrating Jobs from GridEngine to Kubernetes
https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework#Grid_Engine_migration
Migrating Web Services from GridEngine to Kubernetes
https://wikitech.wikimedia.org/wiki/News/Toolforge_Stretch_deprecation#Move_a_grid_engine_webservice
Python
https://wikitech.wikimedia.org/wiki/News/Toolforge_Stretch_deprecation#Rebuild_virtualenv_for_python_users

Event Timeline

Currently there's no easy way to run toolforge commands from within k8s. It would be relatively easy to add health probes to the webservices though, and get the to get restarted by themselves if they don't reply, I'll open a task for that :)

@Dvorapa This is unblocked now by allowing to specify an http probe from the webservice cli (toolforge webservice start --health-check-url /healthz for example), can you try that out and see if that works for you?

@dcaro Seems like it is not deployed yet? toolforge webservice: error: unrecognized arguments: --health-check-url /healthz

Oh, I see. With the flag, https://kmlexport.toolforge.org/ gives 503. Without the flag, it gives 200, so the flag breaks root :/

(note: this is the command kmlexport's webservice is started by: $ webservice --backend kubernetes -m 1536Mi --health-check-path /healthz perl5.36 start if helpful)

$ webservice stop
Stopping webservice
$ webservice --backend kubernetes -m 1536Mi --health-check-path /healthz perl5.36 start
Starting webservice...
$ curl -w %{http_code} -o /dev/null --silent https://kmlexport.toolforge.org/
503
$ webservice stop
Stopping webservice
$ webservice --backend kubernetes -m 1536Mi perl5.36 start
Starting webservice.....
$ curl -w %{http_code} -o /dev/null --silent https://kmlexport.toolforge.org/
200
Warning  Unhealthy  1s (x18 over 18s)  kubelet            Startup probe failed: HTTP probe failed with statuscode: 404
$ curl -w %{http_code} -o /dev/null --silent https://kmlexport.toolforge.org/healthz
404

Seems like you're missing a handler for the /healthz endpoint.

No, I just deactivated the probe as it broke the root of kmlexport webapp (I had to revert back to webservice --backend kubernetes -m 1536Mi perl5.36 start so the users can use it)

It feels weird that the probe breaks the root, as the probe itself does nothing to change the code or the way the code runs, it just does an http request.

Can you enable the health endpoint, but not the health check? that way we can get the service up and manually try to call that health endpoint see the behavior

Not sure what exactly you want me to do. If I start webservice with the probe, the /healthz endpoint is created somehow automatically (why?) - there is a blank page giving code 200. Just the root starts to give 503 somehow, which is super weird.

Here is better idea, what works and how:

$ webservice stop
Stopping webservice
$ webservice --backend kubernetes -m 1536Mi --health-check-path /healthz perl5.36 start
Starting webservice...
$ curl -w %{http_code} -o /dev/null --silent https://kmlexport.toolforge.org/
503
$ curl -w %{http_code} -o /dev/null --silent https://kmlexport.toolforge.org/healthz/
200
$ webservice stop
Stopping webservice
$ webservice --backend kubernetes -m 1536Mi perl5.36 start
Starting webservice.....
$ curl -w %{http_code} -o /dev/null --silent https://kmlexport.toolforge.org/
200
$ curl -w %{http_code} -o /dev/null --silent https://kmlexport.toolforge.org/healthz/
404

What I would expect is both to be 200 at the same time, am I correct?

PS: After several seconds of uptime even /healthz endpoint starts to give 503

Is the code available anywhere for us to inspect?

How did you implement the /healthz endpoint in your code?

If you run it locally, does the /healthz endpoint work?

Also, shouldn't /healthz endpoint be generated automatically by lighttpd, same as mod_status? Or maybe is it a good strategy to use mod_status's endpoint also for probe? Also can I use just root for probe as the whole webapp is just in root?

Also, shouldn't /healthz endpoint be generated automatically by lighttpd, same as mod_status? Or maybe is it a good strategy to use mod_status's endpoint also for probe?

I would recommend creating your own endpoint, and do a couple quick checks if you need (check that the db access is there if you use the DB, or that NFS/files are accessible if you use NFS).

But yes, it's not autogenerated (you can pass whichever endpoint you want). Using mod_status will only tell you that lighttpd is up, but you app might not be working as expected, though it might be good enough depending on your app.

You can also just create the healthz file in the public_html to generate a silly endpoint (that would check that NFS is working btw. so maybe not so silly)

Sorry for the confusion, of course I've had a typo in my code. That's why the /healthz endpoint was working just for several seconds after restart.

Now the root is working and the /healthz endpoint is correctly giving a default CGI header (see its code under https://kmlexport.toolforge.org/healthz/?source= ). Is the webservice already restarting itself when probe is not giving 200? Or is this a patch for another day?

If not passed, a simple TCP check will be used instead.

Does this mean the webservice restarts itself even if no --health-check-path given? (by pinging host or something?)

With the health check path, it is already restarting when the health oath returns !=200, so yes, that feature is already there :)

For the tcp probes (the default when no health check path passed) it just tries to open a connection on the port, but sends nothing, so it will restart the container if lighthttpd is really misbehaving.

I see, thank you for the explanation. So if the /healthz endpoint is set correctly and probe is pointed to that endpoint using the parameter, this task can be considered solved, right? As the periodical health check happens inside the webservice, there should be no need for any health check script or bash running periodically on grid cronjob, correct?

I see, thank you for the explanation. So if the /healthz endpoint is set correctly and probe is pointed to that endpoint using the parameter, this task can be considered solved, right? As the periodical health check happens inside the webservice, there should be no need for any health check script or bash running periodically on grid cronjob, correct?

Yes correct, you don't need any extra job or similar, kubernetes does the check itself :)

Dvorapa claimed this task.