Page MenuHomePhabricator

`qsub sync -y` jobs failing on Grid Engine with "range_list containes no elements" error
Closed, InvalidPublic

Description

The grid jobs submitted by Wsexport are failing with the error "Unable to initialize environment because of error: range_list containes no elements Exiting."

The Grid submission command is "jsub -mem 2g -l release=trusty -sync y xvfb-run -a ebook-convert > /dev/null"

Event Timeline

Tpt created this task.Oct 23 2017, 1:52 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 23 2017, 1:52 PM
Tpt triaged this task as High priority.Oct 23 2017, 1:52 PM

What is throwing that error message: Jsub or xvfb-run? Can you capture more logs from xvfb-run? When did this start? Is this 100% of jobs or less?

Tpt added a comment.Oct 23 2017, 2:04 PM

What is throwing that error message: Jsub or xvfb-run?

It's jsub. It seems to be a quite well known error accoding to a quick googling.

Can you capture more logs from xvfb-run?

It's seems to be the only output. See /data/project/wsexport/xvfb-run.err for the full log

When did this start?

I am not sure. Probably a few days ago. (I have 34k instances of this error in the logs).

Is this 100% of jobs or less?

It's 100% of jobs.

chasemp added a subscriber: bd808.Oct 23 2017, 3:01 PM

Something there had gone horribly wrong w/ 43 copies of the webservice seemingly running. I stopped it, did some poking at the grid itself to see if things were sane, and then @bd808 restarted. I loaded up https://tools.wmflabs.org/wsexport/tool/book.php and made an epub of https://en.wikisource.org/wiki/A_Princess_of_Mars. I'm not sure what triggered that at the moment but I think we are back in working territory.

Tpt added a comment.Oct 23 2017, 3:41 PM

Thank you for having a look at it. Sadly it is still not working.

The epub exportation is not using the grid, it is the exportation to pdf/mobi/txt that is done by first generating the epub and then converting it using calibre. I did a few tests and I still get the same error in the logs.

bd808 added a comment.Oct 23 2017, 3:59 PM

https://stackoverflow.com/questions/4883056/sge-qsub-fails-to-submit-jobs-in-sync-mode seems to indicate that this error is related to qsub sync -y jobs and the qmaster running out of space to track them.

Is this a general exec host pool resource issue?

bd808 added a comment.Oct 23 2017, 6:55 PM

Is this a general exec host pool resource issue?

It could be, yes. I'm sure we haven't changed any config around this so I'm wondering if there are just some other tools that started using this same constrained pool of resources.

bd808 renamed this task from [Wsexport] Grid job submission failing to `qsub sync -y` jobs failing on Grid Engine with "range_list containes no elements" error.Nov 20 2017, 3:35 PM
bd808 updated the task description. (Show Details)

tomcat-setup is an easy way to recreate this error for debugging per T180830 and T180831.

I tried WsExport with mobi format several times in the last weeks. Most of the time it gave a "Bad gateway" error, but in a few occasions it worked correctly. I don't know if this can help.

cscott added a subscriber: cscott.Jan 26 2018, 6:59 AM

Mumble mumble OCG mumble mumble.

For some reason it started working:

tools.zhuyifei1999-test@tools-bastion-05:~$ setup-tomcat
Setting up your public_tomcat directory...
All done.
You can edit the configuration in /data/project/zhuyifei1999-test/public_tomcat/conf/server.xml as needed.
Chicocvenancio added a subscriber: Chicocvenancio.EditedJan 29 2018, 7:24 PM

Checking the number of current running dynamic events clients running yields 0. This explains why it is working right now.
Per https://stackoverflow.com/a/8234227/3930971 and https://github.com/PacificBiosciences/SMRT-Analysis/wiki/When-using-the-sync-option-in-SGE,-you-may-see-errors the maximum number of -sync y jobs in our current configuration is 99.
I am not sure if the errors were in a particular busy time for -sync y jobs or if an error in some tool, or Wsexport itself, artificially populated the queue. If the former, we should investigate raising the limit for sync jobs, otherwise investigate the queue if the errors start happening again.

Ltrlg added a subscriber: Ltrlg.Mar 25 2019, 2:01 PM
bd808 closed this task as Resolved.Mar 26 2019, 12:05 AM
bd808 claimed this task.

No new reports in a year, and the job grid itself has been completely rebuilt in the interim. Certainly should be reopened if a new reproduction case is found.

bd808 added a comment.Apr 4 2019, 3:51 AM

The error still appears (502 Bad Gateway) - and several users pointed at that.

I do not see any sign on the job grid that -sync y jobs are not working. Do you have any application log output showing this, or are you just reporting that some books are failing to export?

Just reporting some failing export, which occurs pretty often now. There are really no solution to this?

Tpt closed this task as Invalid.Apr 4 2019, 12:35 PM

@bd808 Indeed, wsexport has stopped submitting calibre conversion to the grid and does it inside of the webservice worker now. So this task is obsolete

@Ruthven Wsexport issues are tracked on GitHub for now. See https://github.com/wsexport/tool/issues/145 for the opened ticket about this error.

MJL added a subscriber: MJL.Apr 7 2019, 3:36 AM