Page MenuHomePhabricator

ZoomViewer produces a 503 error
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

What happens?:

No webservice
The URL you have requested, https://zoomviewer.toolforge.org/index.php?f=Himalaya,_Indian_Atlas,_sheet_66_(15219000).jpg&flash=no, is not currently serviced.

What should have happened instead?:
Image should be displayed.

Other information (browser name/version, screenshots, etc.):
Chrome on Windows 10.

Also on other large files (no thumbnail is produced for these files):

This is particularly an issue, as for these large files, we give a warning: "The original file is very high-resolution. It might not load properly or could cause your browser to freeze when opened at full size. To avoid these issues, use the ZoomViewer."

This also happens on files where there is a thumbnail:
https://commons.wikimedia.org/wiki/File:Map_of_the_Northern_Part_of_the_Punjab_and_of_Kashmir_(13305002).jpg

Event Timeline

My guess is @dschwen (tool maintainer) will know the best how to manage this issue.

Furthermore, it would be great if maintainers added info to https://toolsadmin.wikimedia.org/tools/id/zoomviewer where to report issues. Thanks in advance!

Hi, This is a serious issue, which should be very high priority, as some images can't be displayed at all.

Okay, I started the adoption process:

Now we need to give 14 days for @dschwen to object against the adoption. Feel free to ping me if I forget to go forward with the adoption process in 14 days (assuming that @dschwen doesn’t object).

I restarted the webservice after seeing @Multichill ask for someone to check in on it:

#wikimedia-cloud IRC 2023-08-23
[18:36]  <    wm-bb> <MaartenDammers> Can one of the admins give the webservice of https://zoomviewer.toolforge.org/ a nudge?
[19:41]  <    bd808> !log tools.zoomviewer Force killed webservice job stuck in deletion since 2023-06-15
[19:41]  < stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zoomviewer/SAL
[19:43]  <   wm-bot> !log tools.zoomviewer <root>  to restart service after a community nudge on IRC
[19:43]  < stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zoomviewer/SAL
[19:44]  <    bd808> @MaartenDammers: it is back up and running for the moment. It looked like somebody tried to stop the webservice on 2023-06-15 and then wandered away with the gridengine job stuck in a deletion pending state.

I spent a long time trying to figure out what command to actually run as the service does not have any admin docs I could find and also has several completely different startup scripts in its $HOME directory. In the end I decided webservice start probably was the correct thing in combination with the $HOME/service.template file that adds --backend=gridengine.

I figured out that the tool has a daily cronjob setup to run this script:

restart.sh
#!/bin/bash

webservice --backend=gridengine --release buster stop
sleep 5
webservice --backend=gridengine --release buster lighttpd start

This doesn't explain why grid engine failed to delete the job, but it does explain how a shutdown was initiated and then not followed up on. @dschwen would have been sent an error email from that cronjob every day from 2023-06-15 until today if they had not coded the crontab in such a way that it never notifies anyone on failure.

Hm, I'm very much open to having a co-admin on this project. My day job and family do not leave.e as much time as I'd like. I'll take a quick look now (tricky behind the work firewall).

Viewer works now

https://zoomviewer.toolforge.org/index.php?f=Chicago.jpg&flash=no

The multi resolution pyramid generation might not work yet though

Nope, I see new multi res tiffs popping up in the cache dir right now

Hm, I'm very much open to having a co-admin on this project.

While I can’t dedicate much time to this tool, I could be a backup person for things like restarting the tool if it stops, maybe even improvements in the future like documenting how it works, migrating to Kubernetes or dropping Flash support to make it simpler. So feel free to add me as a co-maintainer (at https://toolsadmin.wikimedia.org/tools/id/zoomviewer). Please also release the source code under a free software license (if you haven’t done so already) so that I and others are legally allowed to improve it.

The thumbnails on Commons don’t work either, so it’s probably a Commons error, which isn’t nicely handled by the tool, but isn’t caused by it either.

Added you as a co-maintainer and added a license (that was an oversight). There is minimal doco in the README.md. I shall expand on that. But TBH I'm just winging it, as I'm neither an expert in lighthttpd nor gridengine.

Repo: https://github.com/toollabs/zoomviewer

Kubernetes transition would be nice, but the project requires a fastcgi binary to be compiled (and I didn't quite see how the binary compiled on the login node was guaranteed to run in Kubernetes).

@Yann I'll check what's going on with that image. My guess is that the multiresolution pyramid pipeline failed for that image.

Yeah, the VIPS process fails for that image:

$ /usr/bin/vips im_vips2tiff fc2fd120277fc6a343040d5216118bf2.jpg fc2fd120277fc6a343040d5216118bf2.tif:jpeg:75,tile:256x256,pyramid
Killed

Added you as a co-maintainer and added a license (that was an oversight). There is minimal doco in the README.md. I shall expand on that. But TBH I'm just winging it, as I'm neither an expert in lighthttpd nor gridengine.

Repo: https://github.com/toollabs/zoomviewer

Thanks! I’ve added a toolinfo record, so the license and repo link are now visible at https://toolsadmin.wikimedia.org/tools/id/zoomviewer.

Kubernetes transition would be nice, but the project requires a fastcgi binary to be compiled (and I didn't quite see how the binary compiled on the login node was guaranteed to run in Kubernetes).

I see. I have zero experience with fastcgi, so maybe I’m not the best person to do the transition.

fastcgi is still broken

It's not _still_ broken, it is broken yet again. Let me try kicking the webservice again and removing the restart script...

The webservice was running, but the fastcgi service was still broken. Stopping and starting the service got it going again. No idea why it fails. Maybe it is getting killed (running out of memory?)

When I checked it (before restarting), the fastcgi service (and not the lighttpd proxy) responded with a 404 status and a body along the lines Unable to open file '...' (I should have noted the exact filename, but I didn’t – in any case, it looked like something that should exist). So it wasn’t totally dead, it just had trouble opening the file. (I think the error occurred here – unfortunately the code doesn’t read and log the errno, so we don’t know why fopen returned a null pointer; and there are quite a few possible causes. Why do they use C functions in C++ code in the first place?…)

Ah, ok, that's good to know. I did pull the latest iipserver repo version and rebuilt the server on Wednesday, but if it is indeed due to missing files than this is likely related to the dying VIPS process which gets killed by gridengine(?) for running out of memory. I suppose I could ask for a bigger allocation. What do you think @bd808 ?

If it was simply due to a missing file, the condition stat(pstr,&sb)==0 at line 102 wouldn’t be fulfilled and thus the control wouldn’t reach line 111. It should be something less obvious.

I bet it's not a missing file, but rather a broken tif. Those can occur when VIPS crashes/is killed

Ah, ok, that's good to know. I did pull the latest iipserver repo version and rebuilt the server on Wednesday, but if it is indeed due to missing files than this is likely related to the dying VIPS process which gets killed by gridengine(?) for running out of memory. I suppose I could ask for a bigger allocation. What do you think @bd808 ?

Yes, you could try configuring the image processing jobs to request more ram. Grid engine certainly is supposed to cap the resources given an individual process. It just does such a bad job of it that we are often surprised when the limits actually kick in.

Still not working. This is specially an issue because of T307787.

The brokenness of this tool made it to WMDE-TechWish's current experimental "We try to help fixing some tools" working mode. And I had a look to figure out if we could help here.

The thumbnails on Commons don’t work either, so it’s probably a Commons error, which isn’t nicely handled by the tool, but isn’t caused by it either.

Some things got fixed in T344233: Some custom generated thumbnails get massivily cropped. There was an issue with thumb generation from big images where the memory limit was hit.

Kubernetes transition would be nice, but the project requires a fastcgi binary to be compiled (and I didn't quite see how the binary compiled on the login node was guaranteed to run in Kubernetes).

Maybe at least some parts could be moved away from the grid engine though. E.g. the jobs pulling the images and doing the tiff conversion. I'm not so sure about /usr/bin/vips that's used there. As far as I can tell it's not any of the k8 images. ( also haven't looked at all though ). But maybe the conversion can be done by ImageMagick?

Also I wonder if we could compile the binary somewhere else and still used it in the k8 environment.

Maybe at least some parts could be moved away from the grid engine though. E.g. the jobs pulling the images and doing the tiff conversion. I'm not so sure about /usr/bin/vips that's used there. As far as I can tell it's not any of the k8 images. ( also haven't looked at all though ). But maybe the conversion can be done by ImageMagick?

Also I wonder if we could compile the binary somewhere else and still used it in the k8 environment.

I just saw the comment in T320210#9114857 and maybe that's a good option to allow vips in a custom image for these jobs.

This is weird:

for

https://zoomviewer.toolforge.org/fcgi-bin/iipsrv.fcgi?FIF=cache/779543aa14d92a2dff180a4cbc0eb2f6.tif&obj=IIP,1.0&obj=Max-size&obj=Tile-size&obj=Resolution-number

I get

Unable to open file '/data/project/zoomviewer/public_html/cache/779543aa14d92a2dff180a4cbc0eb2f6.tif'

However on toolforge I can see the file:

tools.zoomviewer@tools-sgebastion-10:~$ l /data/project/zoomviewer/public_html/cache/779543aa14d92a2dff180a4cbc0eb2f6.tif
-rw-r--r-- 1 tools.zoomviewer tools.zoomviewer 12765122 Aug 29 10:18 /data/project/zoomviewer/public_html/cache/779543aa14d92a2dff180a4cbc0eb2f6.tif

In the new job system, vips was killed when it ran with the default limit of 500MB, but it completed when I ran it manually with a 6GB memory limit. So I'll change the memory limit in the source.

I got an error from IIPMooViewer initially, with "tiff open failed" in iipsrv.log, but it worked after a reload, so I assume iipsrv briefly saw a partial TIFF file.

tstarling claimed this task.

Deployed, purged cache, refreshed. Worked very nicely this time, no errors.

I increased the CPU limit to 1 core (from 0.5).