The actual state of art is:
- Last edit from the bot: 9. Mai 2016, 19:47:50 (CEST), so, 17:47 UTC
- No job is currently running
- Big .err logs
- tools.merlbot used qsub
The actual state of art is:
|labs/toollabs||master||+27 -7||jsub: Add support for qsub args used by tools.merlbot|
|Open||None||T18660 Database table cleanup (tracking)|
|Open||None||T87716 Missing rows from categorylinks on production servers (dewiki)|
|Open||None||T69556 merl tools (tracking)|
|Resolved||bd808||T135006 tools.merlbot stopped working|
After discussion with @valhallasw on irc, we are not really excited about adding support in the new version of jsub for the qsub array job settings that tools.merlbot uses heavily. The long term (6-12 month) plan is for SGE to be replaced with kubernetes for most Tool jobs. The rewrite of jsub was a precursor to reducing rather than expanding qsub specific support in advance of this migration.
tools.merlbot is certainly an SGE/qsub power user compared to the average tool. Our intent is not to block that usage today or to cause undo hardship on tool developers to adjust their tools to fit the emerging migration plan (T134846).
I have applied this short term fix for tools.merlbot:
This should get the jobs for tools.merlbot working again as long as the full path for jsub/jstart/qcronsub is not hard coded in any of the jobs.
I set PATH=/data/project/merlbot/bin:$PATH for interactive and cron managed jobs in the tools.merlbot2 project as well. That tool account is a member of the tools.merlbot project and executes some of the same qsub scripts.
I don't think this work-around (or T134846 as the pièce de résistance) is an adequate solution, as there is no "problem" to be solved. jsub offered these possibilities, @Merl made use of it, the rewrite's regression can be easily fixed. Why force users to change their behaviour?
The survey found that more than half of the users have less than one hour per week to spend on Wikimedia Tool Labs. It feels absolutely unnecessary to me to cause grievances because in a year there may be a migration to another backend.
And the logic seems flawed to me as well: There will (very probably) not be a 1:1 substitute for the backend (in a year's time). Some options for jsub/qsub make no sense for Kubernetes, while others may be needed for that. So why wreck all invocations that do not work everywhere? Instead, if there is a call to jsub with an option that has no meaning in Kubernetes, if the Kubernetes backend is chosen jsub can just fail, and if another backend is chosen it can do what makes sense there.
AFAIUI, currently our Kubernetes setup is "experimental", with users not being able to control their containers on their own. I'm sure it will be great when it is finished, and it will certainly offer features that SGE does not (and cannot) provide, but breaking a working tool and being righteous about it is not the way to go IMNSHO.
Hi, sorry i am inactive atm and will not be able so access my tools in the next weeks.
My mailbox is full of failed cronjobs reports because of:
/bin/sh: 1: date: not found
/bin/sh: 1: jlocal: not found
@bd808 can you please correct the path in tools.merlbot and tools.merlbot2 crontab, so that these scripts are found again? Hope thats helps to run all my bot scripts again.
And sorry that i am using the advantages of sge. I have no experience in using kubernetes. So please keep already in mind that i will need help to migrate my script to kubernetes. Needed features like dependency trees, limiting parallel execution count and so on are currently administered all by sge.
My use of PATH=...:$PATH inside the crontab file was incorrect. I have now set PATH explicitly to /data/project/merlbot/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin.
And sorry that i am using the advantages of sge. I have no experience in using kubernetes. So please keep already in mind that i will need help to migrate my script to kubernetes.
No need to apologize for making full use of the features of SGE. The longer term issue that the Tool Labs community faces is finding a migration path for users of non-trivial SGE functionality to other systems. This need for migration is driven by the fact that SGE was dropped as a package from Debian Jessie. This makes long term support of the existing SGE-based platform for Tool Labs questionable. Today we just don't have enough system administration resources available to commit to supporting in-house maintained packages or ports from some other distribution.
The 80% use case for Tool Labs tools can be easily covered by other job management systems as most jobs only take advantage of basic features like automatic execution host selection, duplicate job prevention (which is actually something that jsub adds to SGE as far as I know), and restart on job failure (again a jsub invention layered on top of SGE).
Needed features like dependency trees, limiting parallel execution count and so on are currently administered all by sge.
These requirements of your jobs fall outside the basic features that are needed by the majority of jobs. That's not a bad thing, but it does require a more advanced task scheduling system than the typical "run this script forever" and "run this script once" use cases. If I knew an easy solution for these extra requirements today I would certainly attempt to describe it here.
There is no short term plan to remove SGE entirely. Actually as @scfc points out kubernetes migration is very much a work in progress at the beginning stages of rollout. The band-aid I made for your jobs was in the interest of expedience of restoring your jobs rather than a well designed long-term solution. We need to work together to come up with a longer term solution that will allow you to continue to do your work in Tool Labs without blocking migration of the 80% use case to a competing platform that is better supported upstream and open to other workflow and usability changes.
The mid-term solution will likely be to recommend migration of advanced jobs to direct use of qsub rather than relying on the jsub family of wrapper scripts. In order to do that we need to find a replacement for any jsub specific functionality in use (-continuous, -mem, -once, -quiet, -umask). The use of -once was the blocker to me making that change to your crontabs yesterday.