`qsub sync -y` jobs failing on Grid Engine with "range_list containes no elements" error
Open, HighPublic

Description

The grid jobs submitted by Wsexport are failing with the error "Unable to initialize environment because of error: range_list containes no elements Exiting."

The Grid submission command is "jsub -mem 2g -l release=trusty -sync y xvfb-run -a ebook-convert > /dev/null"

Tpt created this task.Oct 23 2017, 1:52 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 23 2017, 1:52 PM
Tpt triaged this task as High priority.Oct 23 2017, 1:52 PM

What is throwing that error message: Jsub or xvfb-run? Can you capture more logs from xvfb-run? When did this start? Is this 100% of jobs or less?

Tpt added a comment.Oct 23 2017, 2:04 PM

What is throwing that error message: Jsub or xvfb-run?

It's jsub. It seems to be a quite well known error accoding to a quick googling.

Can you capture more logs from xvfb-run?

It's seems to be the only output. See /data/project/wsexport/xvfb-run.err for the full log

When did this start?

I am not sure. Probably a few days ago. (I have 34k instances of this error in the logs).

Is this 100% of jobs or less?

It's 100% of jobs.

chasemp added a subscriber: bd808.Oct 23 2017, 3:01 PM

Something there had gone horribly wrong w/ 43 copies of the webservice seemingly running. I stopped it, did some poking at the grid itself to see if things were sane, and then @bd808 restarted. I loaded up https://tools.wmflabs.org/wsexport/tool/book.php and made an epub of https://en.wikisource.org/wiki/A_Princess_of_Mars. I'm not sure what triggered that at the moment but I think we are back in working territory.

Tpt added a comment.Oct 23 2017, 3:41 PM

Thank you for having a look at it. Sadly it is still not working.

The epub exportation is not using the grid, it is the exportation to pdf/mobi/txt that is done by first generating the epub and then converting it using calibre. I did a few tests and I still get the same error in the logs.

bd808 added a comment.Oct 23 2017, 3:59 PM

https://stackoverflow.com/questions/4883056/sge-qsub-fails-to-submit-jobs-in-sync-mode seems to indicate that this error is related to qsub sync -y jobs and the qmaster running out of space to track them.

Is this a general exec host pool resource issue?

bd808 added a comment.Oct 23 2017, 6:55 PM

Is this a general exec host pool resource issue?

It could be, yes. I'm sure we haven't changed any config around this so I'm wondering if there are just some other tools that started using this same constrained pool of resources.

bd808 renamed this task from [Wsexport] Grid job submission failing to `qsub sync -y` jobs failing on Grid Engine with "range_list containes no elements" error.Nov 20 2017, 3:35 PM
bd808 updated the task description. (Show Details)

tomcat-setup is an easy way to recreate this error for debugging per T180830 and T180831.

I tried WsExport with mobi format several times in the last weeks. Most of the time it gave a "Bad gateway" error, but in a few occasions it worked correctly. I don't know if this can help.

cscott added a subscriber: cscott.Jan 26 2018, 6:59 AM

Mumble mumble OCG mumble mumble.

For some reason it started working:

tools.zhuyifei1999-test@tools-bastion-05:~$ setup-tomcat
Setting up your public_tomcat directory...
All done.
You can edit the configuration in /data/project/zhuyifei1999-test/public_tomcat/conf/server.xml as needed.
Chicocvenancio added a subscriber: Chicocvenancio.EditedJan 29 2018, 7:24 PM

Checking the number of current running dynamic events clients running yields 0. This explains why it is working right now.
Per https://stackoverflow.com/a/8234227/3930971 and https://github.com/PacificBiosciences/SMRT-Analysis/wiki/When-using-the-sync-option-in-SGE,-you-may-see-errors the maximum number of -sync y jobs in our current configuration is 99.
I am not sure if the errors were in a particular busy time for -sync y jobs or if an error in some tool, or Wsexport itself, artificially populated the queue. If the former, we should investigate raising the limit for sync jobs, otherwise investigate the queue if the errors start happening again.