Page MenuHomePhabricator

tools.merlbot stopped working
Closed, ResolvedPublic

Description

The actual state of art is:

  • Last edit from the bot: 9. Mai 2016, 19:47:50 (CEST), so, 17:47 UTC
  • No job is currently running
  • Big .err logs
  • tools.merlbot used qsub

Event Timeline

Restricted Application added a project: Cloud-Services. · View Herald TranscriptMay 11 2016, 3:45 PM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald Transcript

MerlBot is the most active bot at dewiki, so a quick solution would be very great for the community there.

bd808 claimed this task.May 11 2016, 4:04 PM

This looks to be fallout from the rewrite of jsub. The cronjobs run by tools.merlbot use several qsub options that are not currently supported by the jsub runner script.

Restricted Application added a project: User-bd808. · View Herald TranscriptMay 11 2016, 4:04 PM

Change 288226 had a related patch set uploaded (by BryanDavis):
jsub: Add support for qsub args used by tools.merlbot

https://gerrit.wikimedia.org/r/288226

After discussion with @valhallasw on irc, we are not really excited about adding support in the new version of jsub for the qsub array job settings that tools.merlbot uses heavily. The long term (6-12 month) plan is for SGE to be replaced with kubernetes for most Tool jobs. The rewrite of jsub was a precursor to reducing rather than expanding qsub specific support in advance of this migration.

tools.merlbot is certainly an SGE/qsub power user compared to the average tool. Our intent is not to block that usage today or to cause undo hardship on tool developers to adjust their tools to fit the emerging migration plan (T134846).

I have applied this short term fix for tools.merlbot:

  • Created ~/bin/legacy-jsub.pl with the version of jsub from rLTOL3d65ea76537d (the last change to the perl script)
  • Added ~/bin/jstart, ~/bin/jsub, and ~/bin/qcronsub simlinks to ~/bin/legacy-jsub.pl
  • Created ~/.profile with PATH=$HOME/bin:$PATH to make these local versions the default for interactive shells.
  • Added PATH=/data/project/merlbot/bin:$PATH in crontab to make these local versions the default for cron processes.

This should get the jobs for tools.merlbot working again as long as the full path for jsub/jstart/qcronsub is not hard coded in any of the jobs.

bd808 added a comment.May 11 2016, 8:11 PM

I set PATH=/data/project/merlbot/bin:$PATH for interactive and cron managed jobs in the tools.merlbot2 project as well. That tool account is a member of the tools.merlbot project and executes some of the same qsub scripts.

bd808 moved this task from Backlog to Doing on the Community-Tech-Tool-Labs board.
scfc added a subscriber: scfc.May 11 2016, 11:24 PM

I don't think this work-around (or T134846 as the pièce de résistance) is an adequate solution, as there is no "problem" to be solved. jsub offered these possibilities, @Merl made use of it, the rewrite's regression can be easily fixed. Why force users to change their behaviour?

The survey found that more than half of the users have less than one hour per week to spend on Wikimedia Tool Labs. It feels absolutely unnecessary to me to cause grievances because in a year there may be a migration to another backend.

And the logic seems flawed to me as well: There will (very probably) not be a 1:1 substitute for the backend (in a year's time). Some options for jsub/qsub make no sense for Kubernetes, while others may be needed for that. So why wreck all invocations that do not work everywhere? Instead, if there is a call to jsub with an option that has no meaning in Kubernetes, if the Kubernetes backend is chosen jsub can just fail, and if another backend is chosen it can do what makes sense there.

AFAIUI, currently our Kubernetes setup is "experimental", with users not being able to control their containers on their own. I'm sure it will be great when it is finished, and it will certainly offer features that SGE does not (and cannot) provide, but breaking a working tool and being righteous about it is not the way to go IMNSHO.

Merl added a comment.May 12 2016, 10:46 AM

Hi, sorry i am inactive atm and will not be able so access my tools in the next weeks.

My mailbox is full of failed cronjobs reports because of:
/bin/sh: 1: date: not found
/bin/sh: 1: jlocal: not found

@bd808 can you please correct the path in tools.merlbot and tools.merlbot2 crontab, so that these scripts are found again? Hope thats helps to run all my bot scripts again.

And sorry that i am using the advantages of sge. I have no experience in using kubernetes. So please keep already in mind that i will need help to migrate my script to kubernetes. Needed features like dependency trees, limiting parallel execution count and so on are currently administered all by sge.

bd808 added a comment.May 12 2016, 3:57 PM

Hi, sorry i am inactive atm and will not be able so access my tools in the next weeks.
My mailbox is full of failed cronjobs reports because of:
/bin/sh: 1: date: not found
/bin/sh: 1: jlocal: not found
@bd808 can you please correct the path in tools.merlbot and tools.merlbot2 crontab, so that these scripts are found again? Hope thats helps to run all my bot scripts again.

My use of PATH=...:$PATH inside the crontab file was incorrect. I have now set PATH explicitly to /data/project/merlbot/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin.

And sorry that i am using the advantages of sge. I have no experience in using kubernetes. So please keep already in mind that i will need help to migrate my script to kubernetes.

No need to apologize for making full use of the features of SGE. The longer term issue that the Tool Labs community faces is finding a migration path for users of non-trivial SGE functionality to other systems. This need for migration is driven by the fact that SGE was dropped as a package from Debian Jessie. This makes long term support of the existing SGE-based platform for Tool Labs questionable. Today we just don't have enough system administration resources available to commit to supporting in-house maintained packages or ports from some other distribution.

The 80% use case for Tool Labs tools can be easily covered by other job management systems as most jobs only take advantage of basic features like automatic execution host selection, duplicate job prevention (which is actually something that jsub adds to SGE as far as I know), and restart on job failure (again a jsub invention layered on top of SGE).

Needed features like dependency trees, limiting parallel execution count and so on are currently administered all by sge.

These requirements of your jobs fall outside the basic features that are needed by the majority of jobs. That's not a bad thing, but it does require a more advanced task scheduling system than the typical "run this script forever" and "run this script once" use cases. If I knew an easy solution for these extra requirements today I would certainly attempt to describe it here.

There is no short term plan to remove SGE entirely. Actually as @scfc points out kubernetes migration is very much a work in progress at the beginning stages of rollout. The band-aid I made for your jobs was in the interest of expedience of restoring your jobs rather than a well designed long-term solution. We need to work together to come up with a longer term solution that will allow you to continue to do your work in Tool Labs without blocking migration of the 80% use case to a competing platform that is better supported upstream and open to other workflow and usability changes.

The mid-term solution will likely be to recommend migration of advanced jobs to direct use of qsub rather than relying on the jsub family of wrapper scripts. In order to do that we need to find a replacement for any jsub specific functionality in use (-continuous, -mem, -once, -quiet, -umask). The use of -once was the blocker to me making that change to your crontabs yesterday.

bd808 moved this task from To Do to Needs Review/Feedback on the User-bd808 board.May 12 2016, 3:57 PM
bd808 added a comment.May 16 2016, 4:38 AM

@Merl, @Luke081515 Is the bot functioning again?

Luke081515 closed this task as Resolved.May 16 2016, 12:29 PM

Yep. Thank you very much for that fix :)

bd808 moved this task from Doing to Done on the Community-Tech-Tool-Labs board.May 16 2016, 3:29 PM

Change 288226 abandoned by BryanDavis:
jsub: Add support for qsub args used by tools.merlbot

Reason:
Merlbot seems to be the only user of these additional options.

https://gerrit.wikimedia.org/r/288226