Page MenuHomePhabricator

`qsub sync -y` jobs failing on Grid Engine with "range_list containes no elements" error
Closed, InvalidPublic

Description

The grid jobs submitted by Wsexport are failing with the error "Unable to initialize environment because of error: range_list containes no elements Exiting."

The Grid submission command is "jsub -mem 2g -l release=trusty -sync y xvfb-run -a ebook-convert > /dev/null"

Event Timeline

Tpt triaged this task as High priority.Oct 23 2017, 1:52 PM

What is throwing that error message: Jsub or xvfb-run? Can you capture more logs from xvfb-run? When did this start? Is this 100% of jobs or less?

What is throwing that error message: Jsub or xvfb-run?

It's jsub. It seems to be a quite well known error accoding to a quick googling.

Can you capture more logs from xvfb-run?

It's seems to be the only output. See /data/project/wsexport/xvfb-run.err for the full log

When did this start?

I am not sure. Probably a few days ago. (I have 34k instances of this error in the logs).

Is this 100% of jobs or less?

It's 100% of jobs.

Something there had gone horribly wrong w/ 43 copies of the webservice seemingly running. I stopped it, did some poking at the grid itself to see if things were sane, and then @bd808 restarted. I loaded up https://tools.wmflabs.org/wsexport/tool/book.php and made an epub of https://en.wikisource.org/wiki/A_Princess_of_Mars. I'm not sure what triggered that at the moment but I think we are back in working territory.

Thank you for having a look at it. Sadly it is still not working.

The epub exportation is not using the grid, it is the exportation to pdf/mobi/txt that is done by first generating the epub and then converting it using calibre. I did a few tests and I still get the same error in the logs.

https://stackoverflow.com/questions/4883056/sge-qsub-fails-to-submit-jobs-in-sync-mode seems to indicate that this error is related to qsub sync -y jobs and the qmaster running out of space to track them.

Is this a general exec host pool resource issue?

Is this a general exec host pool resource issue?

It could be, yes. I'm sure we haven't changed any config around this so I'm wondering if there are just some other tools that started using this same constrained pool of resources.

bd808 renamed this task from [Wsexport] Grid job submission failing to `qsub sync -y` jobs failing on Grid Engine with "range_list containes no elements" error.Nov 20 2017, 3:35 PM
bd808 updated the task description. (Show Details)

tomcat-setup is an easy way to recreate this error for debugging per T180830 and T180831.

I tried WsExport with mobi format several times in the last weeks. Most of the time it gave a "Bad gateway" error, but in a few occasions it worked correctly. I don't know if this can help.

For some reason it started working:

tools.zhuyifei1999-test@tools-bastion-05:~$ setup-tomcat
Setting up your public_tomcat directory...
All done.
You can edit the configuration in /data/project/zhuyifei1999-test/public_tomcat/conf/server.xml as needed.

Checking the number of current running dynamic events clients running yields 0. This explains why it is working right now.
Per https://stackoverflow.com/a/8234227/3930971 and https://github.com/PacificBiosciences/SMRT-Analysis/wiki/When-using-the-sync-option-in-SGE,-you-may-see-errors the maximum number of -sync y jobs in our current configuration is 99.
I am not sure if the errors were in a particular busy time for -sync y jobs or if an error in some tool, or Wsexport itself, artificially populated the queue. If the former, we should investigate raising the limit for sync jobs, otherwise investigate the queue if the errors start happening again.

bd808 claimed this task.

No new reports in a year, and the job grid itself has been completely rebuilt in the interim. Certainly should be reopened if a new reproduction case is found.

The error still appears (502 Bad Gateway) - and several users pointed at that.

I do not see any sign on the job grid that -sync y jobs are not working. Do you have any application log output showing this, or are you just reporting that some books are failing to export?

Just reporting some failing export, which occurs pretty often now. There are really no solution to this?

@bd808 Indeed, wsexport has stopped submitting calibre conversion to the grid and does it inside of the webservice worker now. So this task is obsolete

@Ruthven Wsexport issues are tracked on GitHub for now. See https://github.com/wsexport/tool/issues/145 for the opened ticket about this error.