Page MenuHomePhabricator

IA Uploader: random corrupted text structure into bult djvu files
Open, Needs TriagePublic

Description

Some pages into some djvu files built by IA Uploader suffer from a "corrupted text hierarchy", resulting a disaligment between OCR and page image into wikisource nsPage edit into all following pages.

You can download a corrupted djvu file from this link:
https://upload.wikimedia.org/wikipedia/commons/archive/1/1e/20170907174155%21Folengo_-_Maccheronee%2C_vol_2%2C_1911_-_BEIC_1820192.djvu

You can test for corrupted structure running this DjvuLibre command:
djvulibre -u namefile.djvu -e "output-txt"> dummy.dsed
If "text hierarchy" is corrupted into one or more pages, you'll get an error.

Djvu files, built by IA Uploader when lacking into IA filse, should be tested and possibly fixed before uploading them into Commons.

Event Timeline

Alex_brollo updated the task description. (Show Details)
Alex_brollo updated the task description. (Show Details)
Alex_brollo updated the task description. (Show Details)Oct 13 2017, 8:16 PM
Samwilson added a subscriber: Samwilson.

Two good ideas here.

  1. Add post-build DjVu validation e.g. djvulibre -u namefile.djvu -e "output-txt"> dummy.dsed I guess if it fails, we just log the fact and give up on the upload.
  2. Prevent this bug from happening in the first place. I'm imagining that this is related to removing the Google cover page?

Even if 2 is fixed, 1 is still worthwhile.

I think that the issue isn't related to Google page removal - it occurs into random pages, one or more into the djvu file, both in "empty" and text-containing ones.

There's a trick to fix in part the issue by DjvuLibre routines; you can search for individual corrupted pages running:

djvulibre -u namefile.djvu -e "select n; output-txt"> dummy.dsed (where n is 1, 2, 3... last djvu page)

then running
djvulibre -u namefile.djvu -e "select n; remove-txt" for corrupted pages.

The text layer of corrupted pages will be erased and their OCR lost, but the whole djvu will run.

This is the interactive script I use to fix corrupted djvu files:

#!/usr/bin/python
# -*- coding: utf-8  -*-

from os import system
import sys

            
def djvuFix(f,n):
    # f: djvu file name
    # n: number of djvu pages (it could be obtained too by djvuLibre routines)
    # The script is for interactive use, edit it if automation is needed 
    error=system('djvused -u "'+f+'" -e "select; output-txt" > test.dsed')
    n=int(n)
    if error==0:
        print "Text layer OK"
    elif error==10:
        print "Error 10, text layer probably corrupted; testing pages"
        for i in range(1,n+1):
            error=system('djvused -u "'+f+'" -e "select '+str(i)+'; output-txt" > test.dsed')
            if error==10:
                print "Corrupted page: ",i
                ok=raw_input("Fix (YN)?")
                if ok=="Y":
                    error=system('djvused -u "'+f+'" -e "select '+str(i)+'; remove-txt; save"')
                    if error==0:
                        print "Ok, page ";i;"fixed"
                    else:
                        print "Fixing failed for page: ";i
                    
    else:
        print error
    return

def main(params):
    djvuFix(params[1],params[2])
    return

if __name__ == "__main__":

    djvu=sys.argv
    main(djvu)

Some more details about this bug; please download https://upload.wikimedia.org/wikipedia/commons/archive/1/1e/20170907174155%21Folengo_-_Maccheronee%2C_vol_2%2C_1911_-_BEIC_1820192.djvu as "folengo.djvu" to repeat tests.

  1. if you try djvused -u folengo.djvu -e "select; output-txt" >dummy.dsed you'll got an error "corrupted text hierarchy"
  2. if you try djvused -u folengo.djvu -e "select 297; output-txt" >dummy.dsed you'll got an error "corrupted text hierarchy" since 297 is one of bugged pages
  3. if you open folengo.djvu with DjViewer and go to page 297 with "display text" option, you'll not see any text for the page
  4. if you try djvutxt -page=297 folengo.djvu dummy.txt you'll do not got an error, simply you'll got no text.
  5. if you try djvutxt -page=298 folengo.djvu dummy.txt you'll got the right text output for page 298.

Point 4 is surprising, since proofread extension gives a wrong text for page 298. I presume that extensione does'n use DjvuLibre djvutxt routine to extract text from djvu. What is extension using toi extract text?

Is there any danger to blindly removing text from pages that return an error code of 10? i.e. just looping through the whole work, and running djvused -u file.djvu -e "select x; remove-txt; save" on the corrupt pages? You're doing it interactively — is there a something that makes you abort the process sometimes?

Samwilson added a subscriber: Tpt.Oct 31 2017, 8:02 AM

@Alex_brollo I have got this working. Can you confirm that in your example item 019FolengoLeMaccheronee2Si115 the following pages were the ones needing fixing?

[2017-10-31 15:58:18] LOG.INFO: Validating text layer of DjVu [] []
[2017-10-31 15:58:25] LOG.INFO: Fixing page 294 (1-indexed) [] []
[2017-10-31 15:58:25] LOG.INFO: Fixing page 297 (1-indexed) [] []
[2017-10-31 15:58:25] LOG.INFO: Fixing page 301 (1-indexed) [] []
[2017-10-31 15:58:25] LOG.INFO: Fixing page 302 (1-indexed) [] []
[2017-10-31 15:58:26] LOG.INFO: Fixing page 308 (1-indexed) [] []
[2017-10-31 15:58:26] LOG.INFO: Fixing page 315 (1-indexed) [] []
[2017-10-31 15:58:28] LOG.INFO: Validation complete [] []

I'm not sure what ProofreadPage is using to extract the text; @Tpt will tell us? :-)

I've updated https://tools.wmflabs.org/ia-upload/ with the above fix; see what you think.

@Alex_brollo I have got this working. Can you confirm that in your example item 019FolengoLeMaccheronee2Si115 the following pages were the ones needing fixing?

[2017-10-31 15:58:18] LOG.INFO: Validating text layer of DjVu [] []
[2017-10-31 15:58:25] LOG.INFO: Fixing page 294 (1-indexed) [] []
[2017-10-31 15:58:25] LOG.INFO: Fixing page 297 (1-indexed) [] []
[2017-10-31 15:58:25] LOG.INFO: Fixing page 301 (1-indexed) [] []
[2017-10-31 15:58:25] LOG.INFO: Fixing page 302 (1-indexed) [] []
[2017-10-31 15:58:26] LOG.INFO: Fixing page 308 (1-indexed) [] []
[2017-10-31 15:58:26] LOG.INFO: Fixing page 315 (1-indexed) [] []
[2017-10-31 15:58:28] LOG.INFO: Validation complete [] []

Yes. I can't control now the list one by one, but the number ot total wrong pages and some page numbers are familiar for me.

Is there any danger to blindly removing text from pages that return an error code of 10? i.e. just looping through the whole work, and running djvused -u file.djvu -e "select x; remove-txt; save" on the corrupt pages? You're doing it interactively — is there a something that makes you abort the process sometimes?

I always write interactive python scripts at the beginning - I'm a very poor "programmer". I run the script perhaps 10 times only; once it failed, probably for an out-of-memory issue with a very large djvu file (I work with my very basic PC). When successfully running the script did never fail in fixing corrupted pages.

Samwilson closed this task as Resolved.Nov 29 2017, 12:34 AM
MusikAnimal moved this task from Untriaged to Archive on the Community-Tech board.Dec 12 2017, 8:03 PM
Alex_brollo reopened this task as Open.Dec 22 2017, 8:02 AM

I re-opened this ticket for good news.

"Corrupt hierarchy" bug comes from some "toxic code" into _djvu.xml:

<WORD coords="0,0,0,0,0"> ..... </WORD>

or xml blocks without any WORD descendant.

The new version of xml2dsed.py sniffs and resolves both issues.

TBolliger moved this task from Archive to Untriaged on the Community-Tech board.Jan 31 2018, 12:49 AM
Samwilson removed Samwilson as the assignee of this task.Feb 27 2018, 1:50 AM

I'm unassigning this only because I'm not actually working on it at the moment, not because I don't care about it. :)