Page MenuHomePhabricator

Memory issues for IA-upload when converting large files to djvu
Closed, ResolvedPublic3 Estimated Story Points

Description

When trying to upload and convert the work Armagh clergy and parishes it failed after converting individual pages

armaghclergypari00lesl Armagh clergy and parishes.djvu In progress View log

log ...

[2017-03-18 09:50:19] LOG.INFO: Creating DjVu for armaghclergypari00lesl from Jp2 [] []
[2017-03-18 09:50:19] LOG.INFO: Saving IA metadata to /mnt/nfs/labstore-secondary-tools-project/ia-upload/ia-upload/jobqueue/armaghclergypari00lesl/metadata.json [] []
[2017-03-18 09:50:20] LOG.INFO: Downloading armaghclergypari00lesl/armaghclergypari00lesl_djvu.xml [] []
[2017-03-18 09:50:25] LOG.INFO: Downloading armaghclergypari00lesl/armaghclergypari00lesl_jp2.zip [] []
[2017-03-18 09:52:30] LOG.INFO: Unzipping /mnt/nfs/labstore-secondary-tools-project/ia-upload/ia-upload/jobqueue/armaghclergypari00lesl/armaghclergypari00lesl_jp2.zip [] []
[2017-03-18 09:53:19] LOG.DEBUG: Zip file extracted to /mnt/nfs/labstore-secondary-tools-project/ia-upload/ia-upload/jobqueue/armaghclergypari00lesl/armaghclergypari00lesl_jp2 [] []
[2017-03-18 09:53:19] LOG.INFO: Processing JP2 files [] []
[2017-03-18 09:53:19] LOG.INFO: Converting 518 individual JP2s to DjVus [] []

...

[2017-03-18 10:21:08] LOG.INFO: Merging all DjVu files to /mnt/nfs/labstore-secondary-tools-project/ia-upload/ia-upload/jobqueue/armaghclergypari00lesl/armaghclergypari00lesl.djvu [] []
[2017-03-18 10:21:30] LOG.INFO: Modifying DjVu XML file /mnt/nfs/labstore-secondary-tools-project/ia-upload/ia-upload/jobqueue/armaghclergypari00lesl/armaghclergypari00lesl_djvu.xml to add /mnt/nfs/labstore-secondary-tools-project/ia-upload/ia-upload/jobqueue/armaghclergypari00lesl/armaghclergypari00lesl.djvu [] []
[2017-03-18 10:21:33] LOG.INFO: Merging modified XML into full DjVu file [] []

@Samwilson said that it was a memory issue.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

It's looking like items with more than about 400 pages (the one mentioned here is 518 pages) run into this problem. It correctly converts each individual page, and then merges them into one single DjVu file, but when it comes time to update the OCR data from the XML, a guru meditation happens.

DannyH triaged this task as Medium priority.Apr 4 2017, 11:31 PM

Let's up the memory to 3 gigs and put a 500 page limit on the script.

kaldari lowered the priority of this task from Medium to Low.Apr 4 2017, 11:33 PM
kaldari set the point value for this task to 3.
kaldari moved this task from Needs Discussion to Up Next (June 3-21) on the Community-Tech board.

I've increased the memory allowance to 3 GB and File:Armagh clergy and parishes.djvu has been uploaded. There will be others that will still fail; I'll keep an eye on the queue and see what the page count is.

didn't have a text layer once uploaded to Commons :-(

Maybe we could put in a sanity check to prevent people from trying to upload files that are above a certain really large size to keep people from running into the memory limit (at least for now).

Samwilson moved this task from Ready to Needs Review/Feedback on the Community-Tech-Sprint board.

Some recent *jp2.zip file sizes:

I've added a warning (GH37) for when it's over 600 MB or 800 pages, but not prohibiting the upload, because it looks like it runs out of memory more often in processing either the scans (converting and resizing to jpg) or merging the OCR XML. Both of those things happen later, and so we can't warn the user at job-creation time.

Maybe this will be sufficient.

I'll encourage people to report bugs when they find hung jobs.

PR37 deployed.

I think this can be closed; the other memory-related issues are being tracked separately, or if they're not then specific issues will be opened for them.