Page MenuHomePhabricator

Remove option for PDF → DjVu conversion (phetools)
Open, Needs TriagePublic

Description

The tool (Phetools' pdf_to_djvu_cgi.py) that IA Upload uses to convert IA PDFs into DjVu is no longer operational, so we should remove the option from IA Upload.

There is no great advantage to be had from creating DjVus from these PDFs; the original PDFs can be uploaded directly.

The error that's currently occurring a few times a week looks like:

[2024-04-27T08:12:01.987032+00:00] LOG.INFO: Creating DjVu for in.ernet.dli.2015.285133 from Pdf [] []
[2024-04-27T08:12:01.989211+00:00] LOG.INFO: Requesting start of conversion of in.ernet.dli.2015.285133 [] []
[2024-04-27T08:12:02.153205+00:00] LOG.CRITICAL: Server error: `GET http://tools.wmflabs.org/phetools/pdf_to_djvu_cgi.py?cmd=convert&ia_id=in.ernet.dli.2015.285133` resulted in a `503 SERVICE UNAVAILABLE` response: <!DOCTYPE HTML> <html lang="en">   <head>     <meta http-equiv="Content-Type" content="text/html; charset=utf-8">     <m (truncated...)  [] []

In RequestException.php line 113:
                                                                               
  Server error: `GET http://tools.wmflabs.org/phetools/pdf_to_djvu_cgi.py?cmd  
  =convert&ia_id=in.ernet.dli.2015.285133` resulted in a `503 SERVICE UNAVAIL  
  ABLE` response:                                                              
  <!DOCTYPE HTML>                                                              
  <html lang="en">                                                             
    <head>                                                                     
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">      
      <m (truncated...)

Event Timeline

I agree that in general there is little advantage to creating DjVus from PDFs but sometimes people prefer such formats. PDF technology has now subsumed most of the advantages DjVu previously had. Unfortunately this now means PDF is a very large and complex set of specifications and it is hard to know how any single PDF is constructed without analysis by digital tools.

That said phetools style conversions are not plagued by issues like T268246 when creating DjVus from JP2 bundles.

On the comment "the original PDFs can be uploaded directly", currently there are enough issues with our handling of PDFs (notably bad text layer extraction -- see T242169 -- and bad thumbnail generation -- see e.g. T224355 and linked issues, also note the related issue T339845) that DjVu is still being recommended over PDF on enWS.

It's also due to these issues that I've stopped using ia-upload to import PDFs and switched to generating DjVus from JP2 bundles (getting rid of extra images is much easier to deal with because it can be fixed before proofreading starts, whereas the issues I linked interfere with the whole proofreading process).