`qsub sync -y` jobs failing on Grid Engine with "range_list containes no elements" error
Open, HighPublic

Description

The grid jobs submitted by Wsexport are failing with the error "Unable to initialize environment because of error: range_list containes no elements Exiting."

The Grid submission command is "jsub -mem 2g -l release=trusty -sync y xvfb-run -a ebook-convert > /dev/null"

Tpt created this task.Oct 23 2017, 1:52 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 23 2017, 1:52 PM
Tpt triaged this task as High priority.Oct 23 2017, 1:52 PM

What is throwing that error message: Jsub or xvfb-run? Can you capture more logs from xvfb-run? When did this start? Is this 100% of jobs or less?

Tpt added a comment.Oct 23 2017, 2:04 PM

What is throwing that error message: Jsub or xvfb-run?

It's jsub. It seems to be a quite well known error accoding to a quick googling.

Can you capture more logs from xvfb-run?

It's seems to be the only output. See /data/project/wsexport/xvfb-run.err for the full log

When did this start?

I am not sure. Probably a few days ago. (I have 34k instances of this error in the logs).

Is this 100% of jobs or less?

It's 100% of jobs.

chasemp added a subscriber: bd808.Oct 23 2017, 3:01 PM

Something there had gone horribly wrong w/ 43 copies of the webservice seemingly running. I stopped it, did some poking at the grid itself to see if things were sane, and then @bd808 restarted. I loaded up https://tools.wmflabs.org/wsexport/tool/book.php and made an epub of https://en.wikisource.org/wiki/A_Princess_of_Mars. I'm not sure what triggered that at the moment but I think we are back in working territory.

Tpt added a comment.Oct 23 2017, 3:41 PM

Thank you for having a look at it. Sadly it is still not working.

The epub exportation is not using the grid, it is the exportation to pdf/mobi/txt that is done by first generating the epub and then converting it using calibre. I did a few tests and I still get the same error in the logs.

bd808 added a comment.Oct 23 2017, 3:59 PM

https://stackoverflow.com/questions/4883056/sge-qsub-fails-to-submit-jobs-in-sync-mode seems to indicate that this error is related to qsub sync -y jobs and the qmaster running out of space to track them.

Is this a general exec host pool resource issue?

bd808 added a comment.Oct 23 2017, 6:55 PM

Is this a general exec host pool resource issue?

It could be, yes. I'm sure we haven't changed any config around this so I'm wondering if there are just some other tools that started using this same constrained pool of resources.

bd808 renamed this task from [Wsexport] Grid job submission failing to `qsub sync -y` jobs failing on Grid Engine with "range_list containes no elements" error.Mon, Nov 20, 3:35 PM
bd808 updated the task description. (Show Details)

tomcat-setup is an easy way to recreate this error for debugging per T180830 and T180831.