Page MenuHomePhabricator

zoomviewer runs IO intensive operations locally on tools-webgrid-lighttpd* hosts
Closed, ResolvedPublic

Description

We have been getting an unusual amount of High IO wait warning on lighttpd hosts and I poked about to see why in particular tools-webgrid-lighttpd-1416 has been flapping and unresponsive.

tools.zoomviewer seems to running processes locally to the web host that should be (at the moment best practice) farmed out to the grid where we have more resources to handle this.

 9346 be/4 tools.zo     45.02 M   1588.00 K  0.00 % 26.91 % vips im_vips2tiff cache/~256x256,pyramid [worker]
9344 be/4 tools.zo     22.51 M    996.00 K  0.00 % 17.41 % vips im_vips2tiff cache/~256x256,pyramid [worker]
9345 be/4 tools.zo     45.06 M      2.74 M  0.00 %  8.17 % vips im_vips2tiff cache/~256x256,pyramid [worker]
9343 be/4 tools.zo     45.06 M      2.72 M  0.00 %  8.14 % vips im_vips2tiff cache/~256x256,pyramid [worke

8171 be/4 tools.zo 29.80 M 56.00 K 0.00 % 3.90 % iipsrv.fcgi

Screen Shot 2018-02-01 at 8.44.35 AM.png (1,228×337 px, 270 KB)

Screen Shot 2018-02-01 at 8.48.17 AM.png (1,215×411 px, 344 KB)

tools.zoomviewer:*:51295:dschwen

@dschwen is there anyway you can change this to dispatch onto the grid or serially perform IO functions or something? It is squeezing out other lighttpd tools wherever it goes at the moment periodically.

Event Timeline

chasemp triaged this task as Medium priority.Feb 1 2018, 3:15 PM

Hello @chasemp as a matter of fact I can. I had written code for the panoramic image viewer reprojection that utilizes the grid. I should be able to apply the same to the zoomviewer. I'll work on it over the weekend if that's fine.

Hello @chasemp as a matter of fact I can. I had written code for the panoramic image viewer reprojection that utilizes the grid. I should be able to apply the same to the zoomviewer. I'll work on it over the weekend if that's fine.

Thank you, @dschwen

Just a note for posterity, another reason this model is less good than farming out the processes is that the load is difficult to quantify for something like this where SGE only has first class insight into the original lighttpd process.

i.e. during all of that qstat only shows:

tools.zoomviewer@tools-bastion-03:~$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
  41190 0.31314 lighttpd-z tools.zoomvi r     01/19/2018 00:55:06 webgrid-lighttpd@tools-webgrid     1

All righty! I deployed a new version that uses jsub to deploy the processing tasks on the grid. Unfortunately the -once parameter is still unreliable so I might have to add my own locking if it turns out to be a problem.

tstarling assigned this task to dschwen.
tstarling subscribed.

Thanks for fixing this 6 years ago